diff --git a/projects-appendix/modules/ROOT/attachments/project_template.ipynb b/projects-appendix/modules/ROOT/attachments/project_template.ipynb deleted file mode 100644 index 550d20aed..000000000 --- a/projects-appendix/modules/ROOT/attachments/project_template.ipynb +++ /dev/null @@ -1,190 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "be02a957-7133-4d02-818e-fedeb3cecb05", - "metadata": {}, - "source": [ - "# Project X -- [First Name] [Last Name]" - ] - }, - { - "cell_type": "markdown", - "id": "a1228853-dd19-4ab2-89e0-0394d7d72de3", - "metadata": {}, - "source": [ - "**TA Help:** John Smith, Alice Jones\n", - "\n", - "- Help with figuring out how to write a function.\n", - " \n", - "**Collaboration:** Friend1, Friend2\n", - " \n", - "- Helped figuring out how to load the dataset.\n", - "- Helped debug error with my plot." - ] - }, - { - "cell_type": "markdown", - "id": "6180e742-8e39-4698-98ff-5b00c8cf8ea0", - "metadata": {}, - "source": [ - "## Question 1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "49445606-d363-41b4-b479-e319a9a84c01", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "b456e57c-4a12-464b-999a-ef2df5af80c1", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "fc601975-35ed-4680-a4e1-0273ee3cc047", - "metadata": {}, - "source": [ - "## Question 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a16336a1-1ef0-41e8-bc7c-49387db27497", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "14dc22d4-ddc3-41cc-a91a-cb0025bc0c80", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "8e586edd-ff26-4ce2-8f6b-2424b26f2929", - "metadata": {}, - "source": [ - "## Question 3" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bbe0f40d-9655-4653-9ca8-886bdb61cb91", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "47c6229f-35f7-400c-8366-c442baa5cf47", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "da22f29c-d245-4d2b-9fc1-ca14cb6087d9", - "metadata": {}, - "source": [ - "## Question 4" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8cffc767-d1c8-4d64-b7dc-f0d2ee8a80d1", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "0d552245-b4d6-474a-9cc9-fa7b8e674d55", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "88c9cdac-3e92-498f-83fa-e089bfc44ac8", - "metadata": {}, - "source": [ - "## Question 5" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d370d7c9-06db-42b9-b75f-240481a5c491", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "9fbf00fb-2418-460f-ae94-2a32b0c28952", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "f76442d6-d02e-4f26-b9d6-c3183e1d6929", - "metadata": {}, - "source": [ - "## Pledge\n", - "\n", - "By submitting this work I hereby pledge that this is my own, personal work. I've acknowledged in the designated place at the top of this file all sources that I used to complete said work, including but not limited to: online resources, books, and electronic communications. I've noted all collaboration with fellow students and/or TA's. I did not copy or plagiarize another's work.\n", - "\n", - "> As a Boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together – We are Purdue." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "f2022-s2023", - "language": "python", - "name": "f2022-s2023" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb b/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb deleted file mode 100644 index 411122592..000000000 --- a/projects-appendix/modules/ROOT/attachments/think_summer_project_template.ipynb +++ /dev/null @@ -1,190 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "be02a957-7133-4d02-818e-fedeb3cecb05", - "metadata": {}, - "source": [ - "# Project X -- [First Name] [Last Name]" - ] - }, - { - "cell_type": "markdown", - "id": "a1228853-dd19-4ab2-89e0-0394d7d72de3", - "metadata": {}, - "source": [ - "**TA Help:** John Smith, Alice Jones\n", - "\n", - "- Help with figuring out how to write a function.\n", - " \n", - "**Collaboration:** Friend1, Friend2\n", - " \n", - "- Helped figuring out how to load the dataset.\n", - "- Helped debug error with my plot." - ] - }, - { - "cell_type": "markdown", - "id": "6180e742-8e39-4698-98ff-5b00c8cf8ea0", - "metadata": {}, - "source": [ - "## Question 1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "49445606-d363-41b4-b479-e319a9a84c01", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "b456e57c-4a12-464b-999a-ef2df5af80c1", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "fc601975-35ed-4680-a4e1-0273ee3cc047", - "metadata": {}, - "source": [ - "## Question 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a16336a1-1ef0-41e8-bc7c-49387db27497", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "14dc22d4-ddc3-41cc-a91a-cb0025bc0c80", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "8e586edd-ff26-4ce2-8f6b-2424b26f2929", - "metadata": {}, - "source": [ - "## Question 3" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bbe0f40d-9655-4653-9ca8-886bdb61cb91", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "47c6229f-35f7-400c-8366-c442baa5cf47", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "da22f29c-d245-4d2b-9fc1-ca14cb6087d9", - "metadata": {}, - "source": [ - "## Question 4" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8cffc767-d1c8-4d64-b7dc-f0d2ee8a80d1", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "0d552245-b4d6-474a-9cc9-fa7b8e674d55", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "88c9cdac-3e92-498f-83fa-e089bfc44ac8", - "metadata": {}, - "source": [ - "## Question 5" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d370d7c9-06db-42b9-b75f-240481a5c491", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "9fbf00fb-2418-460f-ae94-2a32b0c28952", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "f76442d6-d02e-4f26-b9d6-c3183e1d6929", - "metadata": {}, - "source": [ - "## Pledge\n", - "\n", - "By submitting this work I hereby pledge that this is my own, personal work. I've acknowledged in the designated place at the top of this file all sources that I used to complete said work, including but not limited to: online resources, books, and electronic communications. I've noted all collaboration with fellow students and/or TA's. I did not copy or plagiarize another's work.\n", - "\n", - "> As a Boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together – We are Purdue." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "think-summer", - "language": "python", - "name": "think-summer" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv deleted file mode 100644 index ba5dc1353..000000000 --- a/projects-appendix/modules/ROOT/examples/10100-2022-projects.csv +++ /dev/null @@ -1,14 +0,0 @@ -Project,Release date,Due date -xref:fall2022/10100/10100-2022-project01.adoc[Project 1: Getting acquainted with Jupyter Lab],August 22,September 9 -xref:fall2022/10100/10100-2022-project02.adoc[Project 2: Introduction to R: part I],August 25,September 9 -xref:fall2022/10100/10100-2022-project03.adoc[Project 3: Introduction to R: part II],September 8,September 16 -xref:fall2022/10100/10100-2022-project04.adoc[Project 4: Introduction to R: part III],September 15,September 23 -xref:fall2022/10100/10100-2022-project05.adoc[Project 5: Tapply],September 22,September 30 -xref:fall2022/10100/10100-2022-project06.adoc[Project 6: Vectorized operations in R],September 29,October 7 -xref:fall2022/10100/10100-2022-project07.adoc[Project 7: Review: part I],October 6,October 21 -xref:fall2022/10100/10100-2022-project08.adoc[Project 8: Review: part II],October 20,October 28 -xref:fall2022/10100/10100-2022-project09.adoc[Project 9: Base R functions],October 27,November 4 -xref:fall2022/10100/10100-2022-project10.adoc[Project 10: Functions in R: part I],November 3,November 11 -"xref:fall2022/10100/10100-2022-project11.adoc[Project 11: Functions in R: part II]",November 10,November 18 -"xref:fall2022/10100/10100-2022-project12.adoc[Project 12: Lists & Sapply]",November 17,December 2 -xref:fall2022/10100/10100-2022-project13.adoc[Project 13: Review: part III],December 1,December 9 diff --git a/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv deleted file mode 100644 index 92e2679c9..000000000 --- a/projects-appendix/modules/ROOT/examples/10200-2023-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2023/10200/10200-2023-project01.adoc[Project 1: Introduction to Python: part I],January 9,January 20 -xref:spring2023/10200/10200-2023-project02.adoc[Project 2: Introduction to Python: part II],January 19,January 27 -xref:spring2023/10200/10200-2023-project03.adoc[Project 3: Introduction to Python: part III],January 26, February 3 -xref:spring2023/10200/10200-2023-project04.adoc[Project 4: Scientific computing & pandas: part I],February 2,February 10 -xref:spring2023/10200/10200-2023-project05.adoc[Project 5: Functions: part I],February 9,February 17 -xref:spring2023/10200/10200-2023-project06.adoc[Project 6: Functions: part II],February 16,February 24 -xref:spring2023/10200/10200-2023-project07.adoc[Project 7: Scientific computing & pandas: part II],February 23,March 3 -xref:spring2023/10200/10200-2023-project08.adoc[Project 8: Scientific computing & pandas: part III],March 2,March 10 -xref:spring2023/10200/10200-2023-project09.adoc[Project 9: Scientific computing & pandas: part IV],March 9,March 24 -xref:spring2023/10200/10200-2023-project10.adoc[Project 10: Importing and using packages],March 23,March 31 -"xref:spring2023/10200/10200-2023-project11.adoc[Project 11: Classes, dunder methods, attributes, methods, etc.: part I]",March 30,April 7 -"xref:spring2023/10200/10200-2023-project12.adoc[Project 12: Classes, dunder methods, attributes, methods, etc.: part II]",April 6,April 14 -xref:spring2023/10200/10200-2023-project13.adoc[Project 13: Data wrangling and matplotlib: part I],April 13,April 21 -xref:spring2023/10200/10200-2023-project14.adoc[Project 14: Data wrangling and matplotlib: part II],April 20,April 28 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv deleted file mode 100644 index 84e589af2..000000000 --- a/projects-appendix/modules/ROOT/examples/10200-2024-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2024/10200/10200-2024-project01.adoc[Project 1: Getting acquainted with Jupyter Lab],8-Jan,19-Jan -xref:spring2024/10200/10200-2024-project02.adoc[Project 2: Python tuples/lists/data frames/matplotlib],11-Jan,26-Jan -xref:spring2024/10200/10200-2024-project03.adoc[Project 3: Looping through files],25-Jan,2-Feb -xref:spring2024/10200/10200-2024-project04.adoc[Project 4: Looping through data frames],1-Feb,9-Feb -xref:spring2024/10200/10200-2024-project05.adoc[Project 5: Writing functions for analyzing data],8-Feb,16-Feb -xref:spring2024/10200/10200-2024-project06.adoc[Project 6: More practice with functions],Feb 15,Feb 23 -xref:spring2024/10200/10200-2024-project07.adoc[Project 7: Even more practice with functions],22-Feb,1-Mar -xref:spring2024/10200/10200-2024-project08.adoc[Project 8: Another project with functions],Feb 29,Mar 8 -xref:spring2024/10200/10200-2024-project09.adoc[Project 9: Deeper dive into functions and analysis of data frames],7-Mar,22-Mar -xref:spring2024/10200/10200-2024-project10.adoc[Project 10: Introduction to numpy],21-Mar,29-Mar -xref:spring2024/10200/10200-2024-project11.adoc[Project 11: Introduction to classes],28-Mar,5-Apr -xref:spring2024/10200/10200-2024-project12.adoc[Project 12: Deeper dive into classes],4-Apr,12-Apr -xref:spring2024/10200/10200-2024-project13.adoc[Project 13: Introduction to flask],11-Apr,19-Apr -xref:spring2024/10200/10200-2024-project14.adoc[Project 14: Feedback about Spring 2024],18-Apr,26-Apr \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv deleted file mode 100644 index 3ab3173a2..000000000 --- a/projects-appendix/modules/ROOT/examples/19000-s2022-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2022/19000/19000-s2022-project01.adoc[Project 1: Introduction to Python: part I],January 6,January 21 -xref:spring2022/19000/19000-s2022-project02.adoc[Project 2: Introduction to Python: part II],January 20,January 28 -xref:spring2022/19000/19000-s2022-project03.adoc[Project 3: Introduction to Python: part III],January 27,February 4 -xref:spring2022/19000/19000-s2022-project04.adoc[Project 4: Scientific computing & pandas: part I],February 3,February 11 -xref:spring2022/19000/19000-s2022-project05.adoc[Project 5: Functions: part I],February 10,February 18 -xref:spring2022/19000/19000-s2022-project06.adoc[Project 6: Functions: part II],February 17,February 25 -xref:spring2022/19000/19000-s2022-project07.adoc[Project 7: Scientific computing & pandas: part II],February 24,March 4 -xref:spring2022/19000/19000-s2022-project08.adoc[Project 8: Scientific computing & pandas: part III],March 3,March 11 -xref:spring2022/19000/19000-s2022-project09.adoc[Project 9: Scientific computing & pandas: part IV],March 17,March 25 -xref:spring2022/19000/19000-s2022-project10.adoc[Project 10: Importing and using packages],March 24,April 1 -"xref:spring2022/19000/19000-s2022-project11.adoc[Project 11: Classes, dunder methods, attributes, methods, etc.: part I]",March 31,April 8 -"xref:spring2022/19000/19000-s2022-project12.adoc[Project 12: Classes, dunder methods, attributes, methods, etc.: part II]",April 7,April 15 -xref:spring2022/19000/19000-s2022-project13.adoc[Project 13: Data wrangling and matplotlib: part I],April 14,April 22 -xref:spring2022/19000/19000-s2022-project14.adoc[Project 14: Data wrangling and matplotlib: part II],April 21,April 29 diff --git a/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv deleted file mode 100644 index 1d452abb5..000000000 --- a/projects-appendix/modules/ROOT/examples/20100-2022-projects.csv +++ /dev/null @@ -1,14 +0,0 @@ -Project,Release date,Due date -xref:fall2022/20100/20100-2022-project01.adoc[Project 1: Review: Jupyter Lab],August 22,September 9 -xref:fall2022/20100/20100-2022-project02.adoc[Project 2: Navigating UNIX: part I],August 25,September 9 -xref:fall2022/20100/20100-2022-project03.adoc[Project 3: Navigating UNIX: part II],September 8,September 16 -xref:fall2022/20100/20100-2022-project04.adoc[Project 4: Pattern matching in UNIX & R],September 15,September 23 -xref:fall2022/20100/20100-2022-project05.adoc[Project 5: awk and bash scripts: part I],September 22,September 30 -xref:fall2022/20100/20100-2022-project06.adoc[Project 6: awk & bash scripts: part II],September 29,October 7 -xref:fall2022/20100/20100-2022-project07.adoc[Project 7: awk & bash scripts: part III],October 6,October 21 -xref:fall2022/20100/20100-2022-project08.adoc[Project 8: SQL: part I],October 20,October 28 -xref:fall2022/20100/20100-2022-project09.adoc[Project 9: SQL: part II],October 27,November 4 -xref:fall2022/20100/20100-2022-project10.adoc[Project 10: SQL: part III],November 3,November 11 -xref:fall2022/20100/20100-2022-project11.adoc[Project 11: SQL: part IV],November 10,November 18 -xref:fall2022/20100/20100-2022-project12.adoc[Project 12: SQL: part V],November 17,December 2 -xref:fall2022/20100/20100-2022-project13.adoc[Project 13: SQL: part VI],December 1,December 9 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv deleted file mode 100644 index 94ca4326d..000000000 --- a/projects-appendix/modules/ROOT/examples/20200-2023-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2023/20200/20200-2023-project01.adoc[Project 1: Introduction to XML],January 9,January 20 -xref:spring2023/20200/20200-2023-project02.adoc[Project 2: Web scraping in Python: part I],January 19,January 27 -xref:spring2023/20200/20200-2023-project03.adoc[Project 3: Web scraping in Python: part II],January 26, February 3 -xref:spring2023/20200/20200-2023-project04.adoc[Project 4: Web scraping in Python: part III],February 2,February 10 -xref:spring2023/20200/20200-2023-project05.adoc[Project 5: Web scraping in Python: part IV],February 9,February 17 -xref:spring2023/20200/20200-2023-project06.adoc[Project 6: Web scraping in Python: part V],February 16,February 24 -xref:spring2023/20200/20200-2023-project07.adoc[Project 7: Plotting in Python: part I],February 23,March 3 -xref:spring2023/20200/20200-2023-project08.adoc[Project 8: Plotting in Python: part II],March 2,March 10 -xref:spring2023/20200/20200-2023-project09.adoc[Project 9: Plotting in Python: part III],March 9,March 24 -xref:spring2023/20200/20200-2023-project10.adoc[Project 10: Plotting with ggplot: part I],March 23,March 31 -xref:spring2023/20200/20200-2023-project11.adoc[Project 11: Plotting with ggplot: part II],March 30,April 7 -xref:spring2023/20200/20200-2023-project12.adoc[Project 12: Tidyverse and data.table: part I],April 6,April 14 -xref:spring2023/20200/20200-2023-project13.adoc[Project 13: Tidyverse and data.table: part II],April 13,April 21 -xref:spring2023/20200/20200-2023-project14.adoc[Project 14: Tidyverse and data.table: part IV],April 20,April 28 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv deleted file mode 100644 index 2ca993ae4..000000000 --- a/projects-appendix/modules/ROOT/examples/20200-2024-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2024/20200/20200-2024-project01.adoc[Project 1: Review: Jupyter Lab],8-Jan,19-Jan -xref:spring2024/20200/20200-2024-project02.adoc[Project 2: Introduction to web scraping with BeautifulSoup],11-Jan,26-Jan -xref:spring2024/20200/20200-2024-project03.adoc[Project 3: Introduction to web scraping with XPath],25-Jan,2-Feb -xref:spring2024/20200/20200-2024-project04.adoc[Project 4: Analyzing more than one hundred thousand XML files at once],1-Feb,9-Feb -xref:spring2024/20200/20200-2024-project05.adoc[Project 5: Extracting information about No Starch Press books from the OReilly website using Selenium],8-Feb,16-Feb -xref:spring2024/20200/20200-2024-project06.adoc[Project 6: Data Visualization],Feb 15,Feb 23 -xref:spring2024/20200/20200-2024-project07.adoc[Project 7: Learning Dash],22-Feb,1-Mar -xref:spring2024/20200/20200-2024-project08.adoc[Project 8: Introduction to Spark SQL],Feb 29,Mar 8 -xref:spring2024/20200/20200-2024-project09.adoc[Project 9: More Spark SQL and also streaming Spark SQL],7-Mar,22-Mar -xref:spring2024/20200/20200-2024-project10.adoc[Project 10: Introduction to Machine Learning],21-Mar,29-Mar -xref:spring2024/20200/20200-2024-project11.adoc[Project 11: More information about Machine Learning],28-Mar,5-Apr -xref:spring2024/20200/20200-2024-project12.adoc[Project 12: Introduction to containerization],4-Apr,12-Apr -xref:spring2024/20200/20200-2024-project13.adoc[Project 13: More information about containerization],11-Apr,19-Apr -xref:spring2024/20200/20200-2024-project14.adoc[Project 14: Feedback about Spring 2024],18-Apr,26-Apr \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv deleted file mode 100644 index 78706a4d6..000000000 --- a/projects-appendix/modules/ROOT/examples/29000-s2022-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2022/29000/29000-s2022-project01.adoc[Project 1: Introduction to XML],January 6,January 21 -xref:spring2022/29000/29000-s2022-project02.adoc[Project 2: Web scraping in Python: part I],January 20,January 28 -xref:spring2022/29000/29000-s2022-project03.adoc[Project 3: Web scraping in Python: part II],January 27,February 4 -xref:spring2022/29000/29000-s2022-project04.adoc[Project 4: Web scraping in Python: part III],February 3,February 11 -xref:spring2022/29000/29000-s2022-project05.adoc[Project 5: Web scraping in Python: part IV],February 10,February 18 -xref:spring2022/29000/29000-s2022-project06.adoc[Project 6: Plotting in Python: part I],February 17,February 25 -xref:spring2022/29000/29000-s2022-project07.adoc[Project 7: Plotting in Python: part II],February 24,March 4 -xref:spring2022/29000/29000-s2022-project08.adoc[Project 8: Writing Python scripts: part I],March 3,March 11 -xref:spring2022/29000/29000-s2022-project09.adoc[Project 9: Writing Python scripts: part II],March 17,March 25 -xref:spring2022/29000/29000-s2022-project10.adoc[Project 10: Plotting with ggplot: part I],March 24,April 1 -xref:spring2022/29000/29000-s2022-project11.adoc[Project 11: Plotting with ggplot: part II],March 31,April 8 -xref:spring2022/29000/29000-s2022-project12.adoc[Project 12: Tidyverse and data.table: part I],April 7,April 15 -xref:spring2022/29000/29000-s2022-project13.adoc[Project 13: Tidyverse and data.table: part II],April 14,April 22 -xref:spring2022/29000/29000-s2022-project14.adoc[Project 14: Tidyverse and data.table: part III],April 21,April 29 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv deleted file mode 100644 index a8eadc627..000000000 --- a/projects-appendix/modules/ROOT/examples/30100-2022-projects.csv +++ /dev/null @@ -1,14 +0,0 @@ -Project,Release date,Due date -"xref:fall2022/30100/30100-2022-project01.adoc[Project 1: Review: Jupyter Lab]",August 22,September 9 -"xref:fall2022/30100/30100-2022-project02.adoc[Project 2: Python documentation: part I]",August 25,September 9 -"xref:fall2022/30100/30100-2022-project03.adoc[Project 3: Python documentation: part II]",September 8,September 16 -"xref:fall2022/30100/30100-2022-project04.adoc[Project 4: Review: part I]",September 15,September 23 -xref:fall2022/30100/30100-2022-project05.adoc[Project 5: Testing in Python: part I],September 22,September 30 -"xref:fall2022/30100/30100-2022-project06.adoc[Project 6: Testing in Python: part II]",September 29,October 7 -xref:fall2022/30100/30100-2022-project07.adoc[Project 7: Review: part II],October 6,October 21 -xref:fall2022/30100/30100-2022-project08.adoc[Project 8: Virtual environments & packages: part I],October 20,October 28 -xref:fall2022/30100/30100-2022-project09.adoc[Project 9: Virtual environments & packages: part II],October 27,November 4 -xref:fall2022/30100/30100-2022-project10.adoc[Project 10: Virtual environments & packages: part III & APIs: part I],November 3,November 11 -xref:fall2022/30100/30100-2022-project11.adoc[Project 11: APIs: part II],November 10,November 18 -xref:fall2022/30100/30100-2022-project12.adoc[Project 12: APIs: part III],November 17,December 2 -xref:fall2022/30100/30100-2022-project13.adoc[Project 13: APIs: part IV],December 1,December 9 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv deleted file mode 100644 index 2c8d67763..000000000 --- a/projects-appendix/modules/ROOT/examples/30200-2023-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -"xref:spring2023/30200/30200-2023-project01.adoc[Project 1: Review: UNIX, terminology, etc.]",January 9,January 20 -"xref:spring2023/30200/30200-2023-project02.adoc[Project 2: Concurrency, parallelism, cores, threads: part I]",January 19,January 27 -"xref:spring2023/30200/30200-2023-project03.adoc[Project 3: Concurrency, parallelism, cores, threads: part II]",January 26, February 3 -"xref:spring2023/30200/30200-2023-project04.adoc[Project 4: Concurrency, parallelism, cores, threads: part III]",February 2,February 10 -xref:spring2023/30200/30200-2023-project05.adoc[Project 5: High performance computing on Brown with SLURM: part I],February 9,February 17 -xref:spring2023/30200/30200-2023-project06.adoc[Project 6: High performance computing on Brown with SLURM: part II],February 16,February 24 -xref:spring2023/30200/30200-2023-project07.adoc[Project 7: High performance computer on Brown with SLURM: part III],February 23,March 3 -xref:spring2023/30200/30200-2023-project08.adoc[Project 8: PyTorch & JAX: part I],March 2,March 10 -xref:spring2023/30200/30200-2023-project09.adoc[Project 9: PyTorch & JAX: part II],March 9,March 24 -xref:spring2023/30200/30200-2023-project10.adoc[Project 10: High performance computing on Brown with SLURM: part IV -- GPUs],March 23,March 31 -xref:spring2023/30200/30200-2023-project11.adoc[Project 11: PyTorch & JAX: part III],March 30,April 7 -xref:spring2023/30200/30200-2023-project12.adoc[Project 12: PyTorch & JAX: part IV],April 6,April 14 -xref:spring2023/30200/30200-2023-project13.adoc[Project 13: ETL fun & review: part I],April 13,April 21 -xref:spring2023/30200/30200-2023-project14.adoc[Project 14: ETL fun & review: part II],April 20,April 28 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv deleted file mode 100644 index 37a628d62..000000000 --- a/projects-appendix/modules/ROOT/examples/30200-2024-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2024/30200/30200-2024-project01.adoc[Project 1],8-Jan,19-Jan -xref:spring2024/30200/30200-2024-project02.adoc[Project 2],11-Jan,26-Jan -xref:spring2024/30200/30200-2024-project03.adoc[Project 3],25-Jan,2-Feb -xref:spring2024/30200/30200-2024-project04.adoc[Project 4],1-Feb,9-Feb -xref:spring2024/30200/30200-2024-project05.adoc[Project 5],8-Feb,16-Feb -xref:spring2024/30200/30200-2024-project06.adoc[Project 6],Feb 15,Feb 23 -xref:spring2024/30200/30200-2024-project07.adoc[Project 7],22-Feb,1-Mar -xref:spring2024/30200/30200-2024-project08.adoc[Project 8],Feb 29,Mar 8 -xref:spring2024/30200/30200-2024-project09.adoc[Project 9],7-Mar,22-Mar -xref:spring2024/30200/30200-2024-project10.adoc[Project 10],21-Mar,29-Mar -xref:spring2024/30200/30200-2024-project11.adoc[Project 11],28-Mar,5-Apr -xref:spring2024/30200/30200-2024-project12.adoc[Project 12],4-Apr,12-Apr -xref:spring2024/30200/30200-2024-project13.adoc[Project 13],11-Apr,19-Apr -xref:spring2024/30200/30200-2024-project14.adoc[Project 14],18-Apr,26-Apr \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv b/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv deleted file mode 100644 index 8ae22472f..000000000 --- a/projects-appendix/modules/ROOT/examples/39000-s2022-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -"xref:spring2022/39000/39000-s2022-project01.adoc[Project 1: Review: UNIX, terminology, etc.]",January 6,January 21 -"xref:spring2022/39000/39000-s2022-project02.adoc[Project 2: Concurrency, parallelism, cores, threads: part I]",January 20,January 28 -"xref:spring2022/39000/39000-s2022-project03.adoc[Project 3: Concurrency, parallelism, cores, threads: part II]",January 27,February 4 -"xref:spring2022/39000/39000-s2022-project04.adoc[Project 4: Concurrency, parallelism, cores, threads: part III]",February 3,February 11 -xref:spring2022/39000/39000-s2022-project05.adoc[Project 5: High performance computing on Brown with SLURM: part I],February 10,February 18 -xref:spring2022/39000/39000-s2022-project06.adoc[Project 6: High performance computing on Brown with SLURM: part II],February 17,February 25 -xref:spring2022/39000/39000-s2022-project07.adoc[Project 7: High performance computer on Brown with SLURM: part III],February 24,March 4 -xref:spring2022/39000/39000-s2022-project08.adoc[Project 8: PyTorch & JAX: part I],March 3,March 11 -xref:spring2022/39000/39000-s2022-project09.adoc[Project 9: PyTorch & JAX: part II],March 17,March 25 -xref:spring2022/39000/39000-s2022-project10.adoc[Project 10: High performance computing on Brown with SLURM: part IV -- GPUs],March 24,April 1 -xref:spring2022/39000/39000-s2022-project11.adoc[Project 11: PyTorch & JAX: part III],March 31,April 8 -xref:spring2022/39000/39000-s2022-project12.adoc[Project 12: PyTorch & JAX: part IV],April 7,April 15 -xref:spring2022/39000/39000-s2022-project13.adoc[Project 13: ETL fun & review: part I],April 14,April 22 -xref:spring2022/39000/39000-s2022-project14.adoc[Project 14: ETL fun & review: part II],April 21,April 29 diff --git a/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv b/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv deleted file mode 100644 index aaf44f432..000000000 --- a/projects-appendix/modules/ROOT/examples/40100-2022-projects.csv +++ /dev/null @@ -1,14 +0,0 @@ -Project,Release date,Due date -"xref:fall2022/40100/40100-2022-project01.adoc[Project 1: Review: Jupyter Lab]",August 22,September 9 -"xref:fall2022/40100/40100-2022-project02.adoc[Project 2: SQLite deepish dive: part I]",August 25,September 9 -"xref:fall2022/40100/40100-2022-project03.adoc[Project 3: SQLite deepish dive: part II]",September 8,September 16 -"xref:fall2022/40100/40100-2022-project04.adoc[Project 4: SQLite deepish dive: part III]",September 15,September 23 -"xref:fall2022/40100/40100-2022-project05.adoc[Project 5: SQLite deepish dive: part IV]",September 22,September 30 -"xref:fall2022/40100/40100-2022-project06.adoc[Project 6: Working with images: part I]",September 29,October 7 -xref:fall2022/40100/40100-2022-project07.adoc[Project 7: Working with images: part II],October 6,October 21 -xref:fall2022/40100/40100-2022-project08.adoc[Project 8: Working with images: part III],October 20,October 28 -xref:fall2022/40100/40100-2022-project09.adoc[Project 9: Working with images: part IV],October 27,November 4 -xref:fall2022/40100/40100-2022-project10.adoc[Project 10: Web scraping and mixed topics: part I],November 3,November 11 -xref:fall2022/40100/40100-2022-project11.adoc[Project 11: Web scraping and mixed topics: part II],November 10,November 18 -xref:fall2022/40100/40100-2022-project12.adoc[Project 12: Web scraping and mixed topics: part III],November 17,December 2 -xref:fall2022/40100/40100-2022-project13.adoc[Project 13: Web scraping and mixed topics: part IV],December 1,December 9 diff --git a/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv b/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv deleted file mode 100644 index 203dfb6ab..000000000 --- a/projects-appendix/modules/ROOT/examples/40200-2023-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -"xref:spring2023/40200/40200-2023-project01.adoc[Project 1: Review JAX]",January 9,January 20 -"xref:spring2023/40200/40200-2023-project02.adoc[Project 2: Building a dashboard: part I]",January 19,January 27 -"xref:spring2023/40200/40200-2023-project03.adoc[Project 3: Building a dashboard: part II]",January 26, February 3 -"xref:spring2023/40200/40200-2023-project04.adoc[Project 4: Building a dashboard: part III]",February 2,February 10 -xref:spring2023/40200/40200-2023-project05.adoc[Project 5: Building a dashboard: part IV],February 9,February 17 -xref:spring2023/40200/40200-2023-project06.adoc[Project 6: Building a dashboard: part V],February 16,February 24 -xref:spring2023/40200/40200-2023-project07.adoc[Project 7: Building a dashboard: part VI],February 23,March 3 -xref:spring2023/40200/40200-2023-project08.adoc[Project 8: Building a dashboard: part VII],March 2,March 10 -xref:spring2023/40200/40200-2023-project09.adoc[Project 9: Building a dashboard: part VIII],March 9,March 24 -xref:spring2023/40200/40200-2023-project10.adoc[Project 10: Building a dashboard: part IX],March 23,March 31 -xref:spring2023/40200/40200-2023-project11.adoc[Project 11: Containers: part I],March 30,April 7 -xref:spring2023/40200/40200-2023-project12.adoc[Project 12: Containers: part II],April 6,April 14 -xref:spring2023/40200/40200-2023-project13.adoc[Project 13: Containers: part III],April 13,April 21 -xref:spring2023/40200/40200-2023-project14.adoc[Project 14: Containers: part IV],April 20,April 28 \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv b/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv deleted file mode 100644 index 3e0ab71c8..000000000 --- a/projects-appendix/modules/ROOT/examples/40200-2024-projects.csv +++ /dev/null @@ -1,15 +0,0 @@ -Project,Release date,Due date -xref:spring2024/40200/40200-2024-project01.adoc[Project 1],8-Jan,19-Jan -xref:spring2024/40200/40200-2024-project02.adoc[Project 2],11-Jan,26-Jan -xref:spring2024/40200/40200-2024-project03.adoc[Project 3],25-Jan,2-Feb -xref:spring2024/40200/40200-2024-project04.adoc[Project 4],1-Feb,9-Feb -xref:spring2024/40200/40200-2024-project05.adoc[Project 5],8-Feb,16-Feb -xref:spring2024/40200/40200-2024-project06.adoc[Project 6],Feb 15,Feb 23 -xref:spring2024/40200/40200-2024-project07.adoc[Project 7],22-Feb,1-Mar -xref:spring2024/40200/40200-2024-project08.adoc[Project 8],Feb 29,Mar 8 -xref:spring2024/40200/40200-2024-project09.adoc[Project 9],7-Mar,22-Mar -xref:spring2024/40200/40200-2024-project10.adoc[Project 10],21-Mar,29-Mar -xref:spring2024/40200/40200-2024-project11.adoc[Project 11],28-Mar,5-Apr -xref:spring2024/40200/40200-2024-project12.adoc[Project 12],4-Apr,12-Apr -xref:spring2024/40200/40200-2024-project13.adoc[Project 13],11-Apr,19-Apr -xref:spring2024/40200/40200-2024-project14.adoc[Project 14],18-Apr,26-Apr \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/images/f24-101-OH.png b/projects-appendix/modules/ROOT/images/f24-101-OH.png deleted file mode 100644 index baf688047..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-OH.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-1.png b/projects-appendix/modules/ROOT/images/f24-101-p1-1.png deleted file mode 100644 index 5725b1061..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-2.png b/projects-appendix/modules/ROOT/images/f24-101-p1-2.png deleted file mode 100644 index 5408aefc7..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-2.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-3.png b/projects-appendix/modules/ROOT/images/f24-101-p1-3.png deleted file mode 100644 index 25af69c7c..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-3.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-101-p1-4.png b/projects-appendix/modules/ROOT/images/f24-101-p1-4.png deleted file mode 100644 index 953d09ff8..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-p1-4.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-101-p10-1.png b/projects-appendix/modules/ROOT/images/f24-101-p10-1.png deleted file mode 100644 index 5375cc4c5..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-101-p10-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-201-OH.png b/projects-appendix/modules/ROOT/images/f24-201-OH.png deleted file mode 100644 index c321b39a1..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-201-OH.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-201-p1-1.png b/projects-appendix/modules/ROOT/images/f24-201-p1-1.png deleted file mode 100644 index 12dd5bdb8..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-201-p1-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-OH.png b/projects-appendix/modules/ROOT/images/f24-301-OH.png deleted file mode 100644 index 850d087e7..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-OH.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG b/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG deleted file mode 100644 index 543e10762..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p11-1.PNG and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p5-1.png b/projects-appendix/modules/ROOT/images/f24-301-p5-1.png deleted file mode 100644 index 401c70ccb..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p5-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG b/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG deleted file mode 100644 index bbd1a2b9f..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p7-1-2.PNG and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG b/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG deleted file mode 100644 index 13cae4be7..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p7-1.PNG and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p8-1.png b/projects-appendix/modules/ROOT/images/f24-301-p8-1.png deleted file mode 100644 index 075de94f7..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p8-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-301-p8-2.png b/projects-appendix/modules/ROOT/images/f24-301-p8-2.png deleted file mode 100644 index 36926d4f2..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-301-p8-2.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/f24-401-OH.png b/projects-appendix/modules/ROOT/images/f24-401-OH.png deleted file mode 100644 index 4e0bbec28..000000000 Binary files a/projects-appendix/modules/ROOT/images/f24-401-OH.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure01.webp b/projects-appendix/modules/ROOT/images/figure01.webp deleted file mode 100644 index f07a29780..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure01.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure02.webp b/projects-appendix/modules/ROOT/images/figure02.webp deleted file mode 100644 index 2460ca5ec..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure02.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure03.webp b/projects-appendix/modules/ROOT/images/figure03.webp deleted file mode 100644 index 064c14c82..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure03.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure04.webp b/projects-appendix/modules/ROOT/images/figure04.webp deleted file mode 100644 index e836fd479..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure04.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure05.webp b/projects-appendix/modules/ROOT/images/figure05.webp deleted file mode 100644 index a6298c950..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure05.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure06.webp b/projects-appendix/modules/ROOT/images/figure06.webp deleted file mode 100644 index 4c543c1ed..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure06.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure07.webp b/projects-appendix/modules/ROOT/images/figure07.webp deleted file mode 100644 index 206ad2fb9..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure07.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure08.webp b/projects-appendix/modules/ROOT/images/figure08.webp deleted file mode 100644 index df664269e..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure08.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure09.webp b/projects-appendix/modules/ROOT/images/figure09.webp deleted file mode 100644 index 3928998ac..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure09.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure10.webp b/projects-appendix/modules/ROOT/images/figure10.webp deleted file mode 100644 index 1e9910f81..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure10.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure11.webp b/projects-appendix/modules/ROOT/images/figure11.webp deleted file mode 100644 index 9ea314a0e..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure11.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure12.webp b/projects-appendix/modules/ROOT/images/figure12.webp deleted file mode 100644 index 905bc1de7..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure12.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure13.webp b/projects-appendix/modules/ROOT/images/figure13.webp deleted file mode 100644 index c9690ef1d..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure13.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure14.webp b/projects-appendix/modules/ROOT/images/figure14.webp deleted file mode 100644 index 7773bc4ba..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure14.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure15.webp b/projects-appendix/modules/ROOT/images/figure15.webp deleted file mode 100644 index 7a1fc82cb..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure15.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure16.webp b/projects-appendix/modules/ROOT/images/figure16.webp deleted file mode 100644 index 7eef43f50..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure16.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure17.webp b/projects-appendix/modules/ROOT/images/figure17.webp deleted file mode 100644 index 0a899198f..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure17.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure18.webp b/projects-appendix/modules/ROOT/images/figure18.webp deleted file mode 100644 index c0f15eb3e..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure18.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure19.webp b/projects-appendix/modules/ROOT/images/figure19.webp deleted file mode 100644 index 4e8335939..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure19.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure20.webp b/projects-appendix/modules/ROOT/images/figure20.webp deleted file mode 100644 index 5625a90a2..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure20.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure21.webp b/projects-appendix/modules/ROOT/images/figure21.webp deleted file mode 100644 index 08b955b56..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure21.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure22.webp b/projects-appendix/modules/ROOT/images/figure22.webp deleted file mode 100644 index ec1850e8e..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure22.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure23.webp b/projects-appendix/modules/ROOT/images/figure23.webp deleted file mode 100644 index 516ce478a..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure23.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure24.webp b/projects-appendix/modules/ROOT/images/figure24.webp deleted file mode 100644 index 69b38477d..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure24.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure25.webp b/projects-appendix/modules/ROOT/images/figure25.webp deleted file mode 100644 index 3b0daa1b4..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure25.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure26.webp b/projects-appendix/modules/ROOT/images/figure26.webp deleted file mode 100644 index a8c6c507f..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure26.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure27.webp b/projects-appendix/modules/ROOT/images/figure27.webp deleted file mode 100644 index fe0db74b3..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure27.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure28.webp b/projects-appendix/modules/ROOT/images/figure28.webp deleted file mode 100644 index 79de2ddf5..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure28.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure29.webp b/projects-appendix/modules/ROOT/images/figure29.webp deleted file mode 100644 index cf915d268..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure29.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure30.webp b/projects-appendix/modules/ROOT/images/figure30.webp deleted file mode 100644 index 120209141..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure30.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure31.webp b/projects-appendix/modules/ROOT/images/figure31.webp deleted file mode 100644 index 923057bdb..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure31.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure32.webp b/projects-appendix/modules/ROOT/images/figure32.webp deleted file mode 100644 index 4d482bd62..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure32.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/figure33.webp b/projects-appendix/modules/ROOT/images/figure33.webp deleted file mode 100644 index 3a67633f3..000000000 Binary files a/projects-appendix/modules/ROOT/images/figure33.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure1.png b/projects-appendix/modules/ROOT/images/stat19000project2figure1.png deleted file mode 100644 index 44821c03a..000000000 Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure2.png b/projects-appendix/modules/ROOT/images/stat19000project2figure2.png deleted file mode 100644 index a98bb800c..000000000 Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure2.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/stat19000project2figure3.png b/projects-appendix/modules/ROOT/images/stat19000project2figure3.png deleted file mode 100644 index 7d1cf3064..000000000 Binary files a/projects-appendix/modules/ROOT/images/stat19000project2figure3.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure01.webp b/projects-appendix/modules/ROOT/images/think-summer-figure01.webp deleted file mode 100644 index eed3513bf..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure01.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure02.webp b/projects-appendix/modules/ROOT/images/think-summer-figure02.webp deleted file mode 100644 index bab93e4ce..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure02.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure03.webp b/projects-appendix/modules/ROOT/images/think-summer-figure03.webp deleted file mode 100644 index 04205a7dd..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure03.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure04.webp b/projects-appendix/modules/ROOT/images/think-summer-figure04.webp deleted file mode 100644 index e38ea94a1..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure04.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure05.webp b/projects-appendix/modules/ROOT/images/think-summer-figure05.webp deleted file mode 100644 index 0e3c82cc7..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure05.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure06.webp b/projects-appendix/modules/ROOT/images/think-summer-figure06.webp deleted file mode 100644 index d4f90f050..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure06.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure07.webp b/projects-appendix/modules/ROOT/images/think-summer-figure07.webp deleted file mode 100644 index 54c103603..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure07.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure08.webp b/projects-appendix/modules/ROOT/images/think-summer-figure08.webp deleted file mode 100644 index 60a41529b..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure08.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure09.webp b/projects-appendix/modules/ROOT/images/think-summer-figure09.webp deleted file mode 100644 index 99ccc491e..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure09.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure10.webp b/projects-appendix/modules/ROOT/images/think-summer-figure10.webp deleted file mode 100644 index 02ab97a54..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure10.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure11.webp b/projects-appendix/modules/ROOT/images/think-summer-figure11.webp deleted file mode 100644 index 72a17da10..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure11.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure12.webp b/projects-appendix/modules/ROOT/images/think-summer-figure12.webp deleted file mode 100644 index 283622b96..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure12.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/images/think-summer-figure13.webp b/projects-appendix/modules/ROOT/images/think-summer-figure13.webp deleted file mode 100644 index 085e2218c..000000000 Binary files a/projects-appendix/modules/ROOT/images/think-summer-figure13.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/nav.adoc b/projects-appendix/modules/ROOT/nav.adoc deleted file mode 100644 index aeadd332a..000000000 --- a/projects-appendix/modules/ROOT/nav.adoc +++ /dev/null @@ -1,488 +0,0 @@ -* Fall 2024 -** xref:fall2024/logistics/office_hours.adoc[Course Office Hours] -** xref:fall2024/logistics/syllabus.adoc[Course Syllabus] -** https://datamine.purdue.edu/events/[Outside Events] -** https://www.piazza.com[Piazza] -** https://ondemand.anvil.rcac.purdue.edu[Anvil] -** https://www.gradescope.com[Gradescope] -** xref:fall2024/10100/10100-2024-projects.adoc[TDM 10100] -*** xref:fall2024/10100/10100-2024-project1.adoc[Project 1] -*** xref:fall2024/10100/10100-2024-project2.adoc[Project 2] -*** xref:fall2024/10100/10100-2024-project3.adoc[Project 3] -*** xref:fall2024/10100/10100-2024-project4.adoc[Project 4] -*** xref:fall2024/10100/10100-2024-project5.adoc[Project 5] -*** xref:fall2024/10100/10100-2024-project6.adoc[Project 6] -*** xref:fall2024/10100/10100-2024-project7.adoc[Project 7] -*** xref:fall2024/10100/10100-2024-project8.adoc[Project 8] -*** xref:fall2024/10100/10100-2024-project9.adoc[Project 9] -*** xref:fall2024/10100/10100-2024-project10.adoc[Project 10] -*** xref:fall2024/10100/10100-2024-project11.adoc[Project 11] -*** xref:fall2024/10100/10100-2024-project12.adoc[Project 12] -*** xref:fall2024/10100/10100-2024-project13.adoc[Project 13] -*** xref:fall2024/10100/10100-2024-project14.adoc[Project 14] -** xref:fall2024/20100/20100-2024-projects.adoc[TDM 20100] -*** xref:fall2024/20100/20100-2024-project1.adoc[Project 1] -*** xref:fall2024/20100/20100-2024-project2.adoc[Project 2] -*** xref:fall2024/20100/20100-2024-project3.adoc[Project 3] -*** xref:fall2024/20100/20100-2024-project4.adoc[Project 4] -*** xref:fall2024/20100/20100-2024-project5.adoc[Project 5] -*** xref:fall2024/20100/20100-2024-project6.adoc[Project 6] -*** xref:fall2024/20100/20100-2024-project7.adoc[Project 7] -*** xref:fall2024/20100/20100-2024-project8.adoc[Project 8] -*** xref:fall2024/20100/20100-2024-project9.adoc[Project 9] -*** xref:fall2024/20100/20100-2024-project10.adoc[Project 10] -*** xref:fall2024/20100/20100-2024-project11.adoc[Project 11] -*** xref:fall2024/20100/20100-2024-project12.adoc[Project 12] -*** xref:fall2024/20100/20100-2024-project13.adoc[Project 13] -*** xref:fall2024/20100/20100-2024-project14.adoc[Project 14] -** xref:fall2024/30100/30100-2024-projects.adoc[TDM 30100] -*** xref:fall2024/30100/30100-2024-project1.adoc[Project 1] -*** xref:fall2024/30100/30100-2024-project2.adoc[Project 2] -*** xref:fall2024/30100/30100-2024-project3.adoc[Project 3] -*** xref:fall2024/30100/30100-2024-project4.adoc[Project 4] -*** xref:fall2024/30100/30100-2024-project5.adoc[Project 5] -*** xref:fall2024/30100/30100-2024-project6.adoc[Project 6] -*** xref:fall2024/30100/30100-2024-project7.adoc[Project 7] -*** xref:fall2024/30100/30100-2024-project8.adoc[Project 8] -*** xref:fall2024/30100/30100-2024-project9.adoc[Project 9] -*** xref:fall2024/30100/30100-2024-project10.adoc[Project 10] -*** xref:fall2024/30100/30100-2024-project11.adoc[Project 11] -*** xref:fall2024/30100/30100-2024-project12.adoc[Project 12] -*** xref:fall2024/30100/30100-2024-project13.adoc[Project 13] -*** xref:fall2024/30100/30100-2024-project14.adoc[Project 14] -** xref:fall2024/40100/40100-2024-projects.adoc[TDM 40100] -*** xref:fall2024/40100/40100-2024-project1.adoc[Project 1] -*** xref:fall2024/40100/40100-2024-project2.adoc[Project 2] -*** xref:fall2024/40100/40100-2024-project3.adoc[Project 3] -*** xref:fall2024/40100/40100-2024-project4.adoc[Project 4] -*** xref:fall2024/40100/40100-2024-project5.adoc[Project 5] -*** xref:fall2024/40100/40100-2024-project6.adoc[Project 6] -*** xref:fall2024/40100/40100-2024-project7.adoc[Project 7] -*** xref:fall2024/40100/40100-2024-project8.adoc[Project 8] -*** xref:fall2024/40100/40100-2024-project9.adoc[Project 9] -*** xref:fall2024/40100/40100-2024-project10.adoc[Project 10] -*** xref:fall2024/40100/40100-2024-project11.adoc[Project 11] -*** xref:fall2024/40100/40100-2024-project12.adoc[Project 12] -*** xref:fall2024/40100/40100-2024-project13.adoc[Project 13] -*** xref:fall2024/40100/40100-2024-project14.adoc[Project 14] - -* Project Archive -** Fall 2020 -*** STAT 19000 -**** xref:fall2020/19000/19000-f2020-project01.adoc[Project 1] -**** xref:fall2020/19000/19000-f2020-project02.adoc[Project 2] -**** xref:fall2020/19000/19000-f2020-project03.adoc[Project 3] -**** xref:fall2020/19000/19000-f2020-project04.adoc[Project 4] -**** xref:fall2020/19000/19000-f2020-project05.adoc[Project 5] -**** xref:fall2020/19000/19000-f2020-project06.adoc[Project 6] -**** xref:fall2020/19000/19000-f2020-project07.adoc[Project 7] -**** xref:fall2020/19000/19000-f2020-project08.adoc[Project 8] -**** xref:fall2020/19000/19000-f2020-project09.adoc[Project 9] -**** xref:fall2020/19000/19000-f2020-project10.adoc[Project 10] -**** xref:fall2020/19000/19000-f2020-project11.adoc[Project 11] -**** xref:fall2020/19000/19000-f2020-project12.adoc[Project 12] -**** xref:fall2020/19000/19000-f2020-project13.adoc[Project 13] -**** xref:fall2020/19000/19000-f2020-project14.adoc[Project 14] -**** xref:fall2020/19000/19000-f2020-project15.adoc[Project 15] -*** STAT 29000 -**** xref:fall2020/29000/29000-f2020-project01.adoc[Project 1] -**** xref:fall2020/29000/29000-f2020-project02.adoc[Project 2] -**** xref:fall2020/29000/29000-f2020-project03.adoc[Project 3] -**** xref:fall2020/29000/29000-f2020-project04.adoc[Project 4] -**** xref:fall2020/29000/29000-f2020-project05.adoc[Project 5] -**** xref:fall2020/29000/29000-f2020-project06.adoc[Project 6] -**** xref:fall2020/29000/29000-f2020-project07.adoc[Project 7] -**** xref:fall2020/29000/29000-f2020-project08.adoc[Project 8] -**** xref:fall2020/29000/29000-f2020-project09.adoc[Project 9] -**** xref:fall2020/29000/29000-f2020-project10.adoc[Project 10] -**** xref:fall2020/29000/29000-f2020-project11.adoc[Project 11] -**** xref:fall2020/29000/29000-f2020-project12.adoc[Project 12] -**** xref:fall2020/29000/29000-f2020-project13.adoc[Project 13] -**** xref:fall2020/29000/29000-f2020-project14.adoc[Project 14] -**** xref:fall2020/29000/29000-f2020-project15.adoc[Project 15] -*** STAT 39000 -**** xref:fall2020/39000/39000-f2020-project01.adoc[Project 1] -**** xref:fall2020/39000/39000-f2020-project02.adoc[Project 2] -**** xref:fall2020/39000/39000-f2020-project03.adoc[Project 3] -**** xref:fall2020/39000/39000-f2020-project04.adoc[Project 4] -**** xref:fall2020/39000/39000-f2020-project05.adoc[Project 5] -**** xref:fall2020/39000/39000-f2020-project06.adoc[Project 6] -**** xref:fall2020/39000/39000-f2020-project07.adoc[Project 7] -**** xref:fall2020/39000/39000-f2020-project08.adoc[Project 8] -**** xref:fall2020/39000/39000-f2020-project09.adoc[Project 9] -**** xref:fall2020/39000/39000-f2020-project10.adoc[Project 10] -**** xref:fall2020/39000/39000-f2020-project11.adoc[Project 11] -**** xref:fall2020/39000/39000-f2020-project12.adoc[Project 12] -**** xref:fall2020/39000/39000-f2020-project13.adoc[Project 13] -**** xref:fall2020/39000/39000-f2020-project14.adoc[Project 14] -**** xref:fall2020/39000/39000-f2020-project15.adoc[Project 15] -** Spring 2021 -*** STAT 19000 -**** xref:spring2021/19000/19000-s2021-project01.adoc[Project 1] -**** xref:spring2021/19000/19000-s2021-project02.adoc[Project 2] -**** xref:spring2021/19000/19000-s2021-project03.adoc[Project 3] -**** xref:spring2021/19000/19000-s2021-project04.adoc[Project 4] -**** xref:spring2021/19000/19000-s2021-project05.adoc[Project 5] -**** xref:spring2021/19000/19000-s2021-project06.adoc[Project 6] -**** xref:spring2021/19000/19000-s2021-project07.adoc[Project 7] -**** xref:spring2021/19000/19000-s2021-project08.adoc[Project 8] -**** xref:spring2021/19000/19000-s2021-project09.adoc[Project 9] -**** xref:spring2021/19000/19000-s2021-project10.adoc[Project 10] -**** xref:spring2021/19000/19000-s2021-project11.adoc[Project 11] -**** xref:spring2021/19000/19000-s2021-project12.adoc[Project 12] -**** xref:spring2021/19000/19000-s2021-project13.adoc[Project 13] -**** xref:spring2021/19000/19000-s2021-project14.adoc[Project 14] -**** xref:spring2021/19000/19000-s2021-project15.adoc[Project 15] -*** STAT 29000 -**** xref:spring2021/29000/29000-s2021-project01.adoc[Project 1] -**** xref:spring2021/29000/29000-s2021-project02.adoc[Project 2] -**** xref:spring2021/29000/29000-s2021-project03.adoc[Project 3] -**** xref:spring2021/29000/29000-s2021-project04.adoc[Project 4] -**** xref:spring2021/29000/29000-s2021-project05.adoc[Project 5] -**** xref:spring2021/29000/29000-s2021-project06.adoc[Project 6] -**** xref:spring2021/29000/29000-s2021-project07.adoc[Project 7] -**** xref:spring2021/29000/29000-s2021-project08.adoc[Project 8] -**** xref:spring2021/29000/29000-s2021-project09.adoc[Project 9] -**** xref:spring2021/29000/29000-s2021-project10.adoc[Project 10] -**** xref:spring2021/29000/29000-s2021-project11.adoc[Project 11] -**** xref:spring2021/29000/29000-s2021-project12.adoc[Project 12] -**** xref:spring2021/29000/29000-s2021-project13.adoc[Project 13] -**** xref:spring2021/29000/29000-s2021-project14.adoc[Project 14] -**** xref:spring2021/29000/29000-s2021-project15.adoc[Project 15] -*** STAT 39000 -**** xref:spring2021/39000/39000-s2021-project01.adoc[Project 1] -**** xref:spring2021/39000/39000-s2021-project02.adoc[Project 2] -**** xref:spring2021/39000/39000-s2021-project03.adoc[Project 3] -**** xref:spring2021/39000/39000-s2021-project04.adoc[Project 4] -**** xref:spring2021/39000/39000-s2021-project05.adoc[Project 5] -**** xref:spring2021/39000/39000-s2021-project06.adoc[Project 6] -**** xref:spring2021/39000/39000-s2021-project07.adoc[Project 7] -**** xref:spring2021/39000/39000-s2021-project08.adoc[Project 8] -**** xref:spring2021/39000/39000-s2021-project09.adoc[Project 9] -**** xref:spring2021/39000/39000-s2021-project10.adoc[Project 10] -**** xref:spring2021/39000/39000-s2021-project11.adoc[Project 11] -**** xref:spring2021/39000/39000-s2021-project12.adoc[Project 12] -**** xref:spring2021/39000/39000-s2021-project13.adoc[Project 13] -**** xref:spring2021/39000/39000-s2021-project14.adoc[Project 14] -**** xref:spring2021/39000/39000-s2021-project15.adoc[Project 15] -** Fall 2021 -*** xref:fall2021/19000/19000-f2021-projects.adoc[STAT 19000] -**** xref:fall2021/logistics/19000-f2021-officehours.adoc[Office Hours] -**** xref:fall2021/19000/19000-f2021-project01.adoc[Project 1] -**** xref:fall2021/19000/19000-f2021-project02.adoc[Project 2] -**** xref:fall2021/19000/19000-f2021-project03.adoc[Project 3] -**** xref:fall2021/19000/19000-f2021-project04.adoc[Project 4] -**** xref:fall2021/19000/19000-f2021-project05.adoc[Project 5] -**** xref:fall2021/19000/19000-f2021-project06.adoc[Project 6] -**** xref:fall2021/19000/19000-f2021-project07.adoc[Project 7] -**** xref:fall2021/19000/19000-f2021-project08.adoc[Project 8] -**** xref:fall2021/19000/19000-f2021-project09.adoc[Project 9] -**** xref:fall2021/19000/19000-f2021-project10.adoc[Project 10] -**** xref:fall2021/19000/19000-f2021-project11.adoc[Project 11] -**** xref:fall2021/19000/19000-f2021-project12.adoc[Project 12] -**** xref:fall2021/19000/19000-f2021-project13.adoc[Project 13] -*** xref:fall2021/29000/29000-f2021-projects.adoc[STAT 29000] -**** xref:fall2021/logistics/29000-f2021-officehours.adoc[Office Hours] -**** xref:fall2021/29000/29000-f2021-project01.adoc[Project 1] -**** xref:fall2021/29000/29000-f2021-project02.adoc[Project 2] -**** xref:fall2021/29000/29000-f2021-project03.adoc[Project 3] -**** xref:fall2021/29000/29000-f2021-project04.adoc[Project 4] -**** xref:fall2021/29000/29000-f2021-project05.adoc[Project 5] -**** xref:fall2021/29000/29000-f2021-project06.adoc[Project 6] -**** xref:fall2021/29000/29000-f2021-project07.adoc[Project 7] -**** xref:fall2021/29000/29000-f2021-project08.adoc[Project 8] -**** xref:fall2021/29000/29000-f2021-project09.adoc[Project 9] -**** xref:fall2021/29000/29000-f2021-project10.adoc[Project 10] -**** xref:fall2021/29000/29000-f2021-project11.adoc[Project 11] -**** xref:fall2021/29000/29000-f2021-project12.adoc[Project 12] -**** xref:fall2021/29000/29000-f2021-project13.adoc[Project 13] -*** xref:fall2021/39000/39000-f2021-projects.adoc[STAT 39000] -**** xref:fall2021/logistics/39000-f2021-officehours.adoc[Office Hours] -**** xref:fall2021/39000/39000-f2021-project01.adoc[Project 1] -**** xref:fall2021/39000/39000-f2021-project02.adoc[Project 2] -**** xref:fall2021/39000/39000-f2021-project03.adoc[Project 3] -**** xref:fall2021/39000/39000-f2021-project04.adoc[Project 4] -**** xref:fall2021/39000/39000-f2021-project05.adoc[Project 5] -**** xref:fall2021/39000/39000-f2021-project06.adoc[Project 6] -**** xref:fall2021/39000/39000-f2021-project07.adoc[Project 7] -**** xref:fall2021/39000/39000-f2021-project08.adoc[Project 8] -**** xref:fall2021/39000/39000-f2021-project09.adoc[Project 9] -**** xref:fall2021/39000/39000-f2021-project10.adoc[Project 10] -**** xref:fall2021/39000/39000-f2021-project11.adoc[Project 11] -**** xref:fall2021/39000/39000-f2021-project12.adoc[Project 12] -**** xref:fall2021/39000/39000-f2021-project13.adoc[Project 13] -** Spring 2022 -*** xref:spring2022/19000/19000-s2022-projects.adoc[STAT 19000] -**** xref:spring2022/19000/19000-s2022-project01.adoc[Project 1] -**** xref:spring2022/19000/19000-s2022-project02.adoc[Project 2] -**** xref:spring2022/19000/19000-s2022-project03.adoc[Project 3] -**** xref:spring2022/19000/19000-s2022-project04.adoc[Project 4] -**** xref:spring2022/19000/19000-s2022-project05.adoc[Project 5] -**** xref:spring2022/19000/19000-s2022-project06.adoc[Project 6] -**** xref:spring2022/19000/19000-s2022-project07.adoc[Project 7] -**** xref:spring2022/19000/19000-s2022-project08.adoc[Project 8] -**** xref:spring2022/19000/19000-s2022-project09.adoc[Project 9] -**** xref:spring2022/19000/19000-s2022-project10.adoc[Project 10] -**** xref:spring2022/19000/19000-s2022-project11.adoc[Project 11] -**** xref:spring2022/19000/19000-s2022-project12.adoc[Project 12] -**** xref:spring2022/19000/19000-s2022-project13.adoc[Project 13] -**** xref:spring2022/19000/19000-s2022-project14.adoc[Project 14] -*** xref:spring2022/29000/29000-s2022-projects.adoc[STAT 29000] -**** xref:spring2022/29000/29000-s2022-project01.adoc[Project 1] -**** xref:spring2022/29000/29000-s2022-project02.adoc[Project 2] -**** xref:spring2022/29000/29000-s2022-project03.adoc[Project 3] -**** xref:spring2022/29000/29000-s2022-project04.adoc[Project 4] -**** xref:spring2022/29000/29000-s2022-project05.adoc[Project 5] -**** xref:spring2022/29000/29000-s2022-project06.adoc[Project 6] -**** xref:spring2022/29000/29000-s2022-project07.adoc[Project 7] -**** xref:spring2022/29000/29000-s2022-project08.adoc[Project 8] -**** xref:spring2022/29000/29000-s2022-project09.adoc[Project 9] -**** xref:spring2022/29000/29000-s2022-project10.adoc[Project 10] -**** xref:spring2022/29000/29000-s2022-project11.adoc[Project 11] -**** xref:spring2022/29000/29000-s2022-project12.adoc[Project 12] -**** xref:spring2022/29000/29000-s2022-project13.adoc[Project 13] -**** xref:spring2022/29000/29000-s2022-project14.adoc[Project 14] -*** xref:spring2022/39000/39000-s2022-projects.adoc[STAT 39000] -**** xref:spring2022/39000/39000-s2022-project01.adoc[Project 1] -**** xref:spring2022/39000/39000-s2022-project02.adoc[Project 2] -**** xref:spring2022/39000/39000-s2022-project03.adoc[Project 3] -**** xref:spring2022/39000/39000-s2022-project04.adoc[Project 4] -**** xref:spring2022/39000/39000-s2022-project05.adoc[Project 5] -**** xref:spring2022/39000/39000-s2022-project06.adoc[Project 6] -**** xref:spring2022/39000/39000-s2022-project07.adoc[Project 7] -**** xref:spring2022/39000/39000-s2022-project08.adoc[Project 8] -**** xref:spring2022/39000/39000-s2022-project09.adoc[Project 9] -**** xref:spring2022/39000/39000-s2022-project10.adoc[Project 10] -**** xref:spring2022/39000/39000-s2022-project11.adoc[Project 11] -**** xref:spring2022/39000/39000-s2022-project12.adoc[Project 12] -**** xref:spring2022/39000/39000-s2022-project13.adoc[Project 13] -**** xref:spring2022/39000/39000-s2022-project14.adoc[Project 14] -** Fall 2022 -*** xref:fall2022/10100/10100-2022-projects.adoc[TDM 101] -**** xref:fall2022/logistics/10100-2022-officehours.adoc[Office Hours] -**** xref:fall2022/10100/10100-2022-project01.adoc[Project 1] -**** xref:fall2022/10100/10100-2022-project02.adoc[Project 2] -**** xref:fall2022/10100/10100-2022-project03.adoc[Project 3] -**** xref:fall2022/10100/10100-2022-project04.adoc[Project 4] -**** xref:fall2022/10100/10100-2022-project05.adoc[Project 5] -**** xref:fall2022/10100/10100-2022-project06.adoc[Project 6] -**** xref:fall2022/10100/10100-2022-project07.adoc[Project 7] -**** xref:fall2022/10100/10100-2022-project08.adoc[Project 8] -**** xref:fall2022/10100/10100-2022-project09.adoc[Project 9] -**** xref:fall2022/10100/10100-2022-project10.adoc[Project 10] -**** xref:fall2022/10100/10100-2022-project11.adoc[Project 11] -**** xref:fall2022/10100/10100-2022-project12.adoc[Project 12] -**** xref:fall2022/10100/10100-2022-project13.adoc[Project 13] -*** xref:fall2022/20100/20100-2022-projects.adoc[TDM 201] -**** xref:fall2022/logistics/20100-2022-officehours.adoc[Office Hours] -**** xref:fall2022/20100/20100-2022-project01.adoc[Project 1] -**** xref:fall2022/20100/20100-2022-project02.adoc[Project 2] -**** xref:fall2022/20100/20100-2022-project03.adoc[Project 3] -**** xref:fall2022/20100/20100-2022-project04.adoc[Project 4] -**** xref:fall2022/20100/20100-2022-project05.adoc[Project 5] -**** xref:fall2022/20100/20100-2022-project06.adoc[Project 6] -**** xref:fall2022/20100/20100-2022-project07.adoc[Project 7] -**** xref:fall2022/20100/20100-2022-project08.adoc[Project 8] -**** xref:fall2022/20100/20100-2022-project09.adoc[Project 9] -**** xref:fall2022/20100/20100-2022-project10.adoc[Project 10] -**** xref:fall2022/20100/20100-2022-project11.adoc[Project 11] -**** xref:fall2022/20100/20100-2022-project12.adoc[Project 12] -**** xref:fall2022/20100/20100-2022-project13.adoc[Project 13] -*** xref:fall2022/30100/30100-2022-projects.adoc[TDM 301] -**** xref:fall2022/logistics/30100-2022-officehours.adoc[Office Hours] -**** xref:fall2022/30100/30100-2022-project01.adoc[Project 1] -**** xref:fall2022/30100/30100-2022-project02.adoc[Project 2] -**** xref:fall2022/30100/30100-2022-project03.adoc[Project 3] -**** xref:fall2022/30100/30100-2022-project04.adoc[Project 4] -**** xref:fall2022/30100/30100-2022-project05.adoc[Project 5] -**** xref:fall2022/30100/30100-2022-project06.adoc[Project 6] -**** xref:fall2022/30100/30100-2022-project07.adoc[Project 7] -**** xref:fall2022/30100/30100-2022-project08.adoc[Project 8] -**** xref:fall2022/30100/30100-2022-project09.adoc[Project 9] -**** xref:fall2022/30100/30100-2022-project10.adoc[Project 10] -**** xref:fall2022/30100/30100-2022-project11.adoc[Project 11] -**** xref:fall2022/30100/30100-2022-project12.adoc[Project 12] -**** xref:fall2022/30100/30100-2022-project13.adoc[Project 13] -*** xref:fall2022/40100/40100-2022-projects.adoc[TDM 401] -**** xref:fall2022/logistics/40100-2022-officehours.adoc[Office Hours] -**** xref:fall2022/40100/40100-2022-project01.adoc[Project 1] -**** xref:fall2022/40100/40100-2022-project02.adoc[Project 2] -**** xref:fall2022/40100/40100-2022-project03.adoc[Project 3] -**** xref:fall2022/40100/40100-2022-project04.adoc[Project 4] -**** xref:fall2022/40100/40100-2022-project05.adoc[Project 5] -**** xref:fall2022/40100/40100-2022-project06.adoc[Project 6] -**** xref:fall2022/40100/40100-2022-project07.adoc[Project 7] -**** xref:fall2022/40100/40100-2022-project08.adoc[Project 8] -**** xref:fall2022/40100/40100-2022-project09.adoc[Project 9] -**** xref:fall2022/40100/40100-2022-project10.adoc[Project 10] -**** xref:fall2022/40100/40100-2022-project11.adoc[Project 11] -**** xref:fall2022/40100/40100-2022-project12.adoc[Project 12] -**** xref:fall2022/40100/40100-2022-project13.adoc[Project 13] -** Spring 2023 -*** xref:spring2023/10200/10200-2023-projects.adoc[TDM 102] -**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours] -**** xref:spring2023/10200/10200-2023-project01.adoc[Project 1] -**** xref:spring2023/10200/10200-2023-project02.adoc[Project 2] -**** xref:spring2023/10200/10200-2023-project03.adoc[Project 3] -**** xref:spring2023/10200/10200-2023-project04.adoc[Project 4] -**** xref:spring2023/10200/10200-2023-project05.adoc[Project 5] -**** xref:spring2023/10200/10200-2023-project06.adoc[Project 6] -**** xref:spring2023/10200/10200-2023-project07.adoc[Project 7] -**** xref:spring2023/10200/10200-2023-project08.adoc[Project 8] -**** xref:spring2023/10200/10200-2023-project09.adoc[Project 9] -**** xref:spring2023/10200/10200-2023-project10.adoc[Project 10] -**** xref:spring2023/10200/10200-2023-project11.adoc[Project 11] -**** xref:spring2023/10200/10200-2023-project12.adoc[Project 12] -**** xref:spring2023/10200/10200-2023-project13.adoc[Project 13] -*** xref:spring2023/20200/20200-2023-projects.adoc[TDM 202] -**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours] -**** xref:spring2023/20200/20200-2023-project01.adoc[Project 1] -**** xref:spring2023/20200/20200-2023-project02.adoc[Project 2] -**** xref:spring2023/20200/20200-2023-project03.adoc[Project 3] -**** xref:spring2023/20200/20200-2023-project04.adoc[Project 4] -**** xref:spring2023/20200/20200-2023-project05.adoc[Project 5] -**** xref:spring2023/20200/20200-2023-project06.adoc[Project 6] -**** xref:spring2023/20200/20200-2023-project07.adoc[Project 7] -**** xref:spring2023/20200/20200-2023-project08.adoc[Project 8] -**** xref:spring2023/20200/20200-2023-project09.adoc[Project 9] -**** xref:spring2023/20200/20200-2023-project10.adoc[Project 10] -**** xref:spring2023/20200/20200-2023-project11.adoc[Project 11] -**** xref:spring2023/20200/20200-2023-project12.adoc[Project 12] -**** xref:spring2023/20200/20200-2023-project13.adoc[Project 13] -*** xref:spring2023/30200/30200-2023-projects.adoc[TDM 302] -**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours] -**** xref:spring2023/30200/30200-2023-project01.adoc[Project 1] -**** xref:spring2023/30200/30200-2023-project02.adoc[Project 2] -**** xref:spring2023/30200/30200-2023-project03.adoc[Project 3] -**** xref:spring2023/30200/30200-2023-project04.adoc[Project 4] -**** xref:spring2023/30200/30200-2023-project05.adoc[Project 5] -**** xref:spring2023/30200/30200-2023-project06.adoc[Project 6] -**** xref:spring2023/30200/30200-2023-project07.adoc[Project 7] -**** xref:spring2023/30200/30200-2023-project08.adoc[Project 8] -**** xref:spring2023/30200/30200-2023-project09.adoc[Project 9] -**** xref:spring2023/30200/30200-2023-project10.adoc[Project 10] -**** xref:spring2023/30200/30200-2023-project11.adoc[Project 11] -**** xref:spring2023/30200/30200-2023-project12.adoc[Project 12] -**** xref:spring2023/30200/30200-2023-project13.adoc[Project 13] -*** xref:spring2023/40200/40200-2023-projects.adoc[TDM 402] -**** xref:spring2023/logistics/TA/office_hours.adoc[Office Hours] -**** xref:spring2023/40200/40200-2023-project01.adoc[Project 1] -**** xref:spring2023/40200/40200-2023-project02.adoc[Project 2] -**** xref:spring2023/40200/40200-2023-project03.adoc[Project 3] -**** xref:spring2023/40200/40200-2023-project04.adoc[Project 4] -**** xref:spring2023/40200/40200-2023-project05.adoc[Project 5] -**** xref:spring2023/40200/40200-2023-project06.adoc[Project 6] -**** xref:spring2023/40200/40200-2023-project07.adoc[Project 7] -**** xref:spring2023/40200/40200-2023-project08.adoc[Project 8] -**** xref:spring2023/40200/40200-2023-project09.adoc[Project 9] -**** xref:spring2023/40200/40200-2023-project10.adoc[Project 10] -**** xref:spring2023/40200/40200-2023-project11.adoc[Project 11] -**** xref:spring2023/40200/40200-2023-project12.adoc[Project 12] -**** xref:spring2023/40200/40200-2023-project13.adoc[Project 13] -** Fall 2023 -*** xref:fall2023/10100/10100-2023-projects.adoc[TDM 101] -**** xref:fall2023/logistics/office_hours_101.adoc[Office Hours] -**** xref:fall2023/10100/10100-2023-project01.adoc[Project 1] -**** xref:fall2023/10100/10100-2023-project02.adoc[Project 2] -**** xref:fall2023/10100/10100-2023-project03.adoc[Project 3] -**** xref:fall2023/10100/10100-2023-project04.adoc[Project 4] -**** xref:fall2023/10100/10100-2023-project05.adoc[Project 5] -**** xref:fall2023/10100/10100-2023-project06.adoc[Project 6] -**** xref:fall2023/10100/10100-2023-project07.adoc[Project 7] -**** xref:fall2023/10100/10100-2023-project08.adoc[Project 8] -**** xref:fall2023/10100/10100-2023-project09.adoc[Project 9] -**** xref:fall2023/10100/10100-2023-project10.adoc[Project 10] -**** xref:fall2023/10100/10100-2023-project11.adoc[Project 11] -**** xref:fall2023/10100/10100-2023-project12.adoc[Project 12] -**** xref:fall2023/10100/10100-2023-project13.adoc[Project 13] -*** xref:fall2023/20100/20100-2023-projects.adoc[TDM 201] -**** xref:fall2023/logistics/office_hours_201.adoc[Office Hours] -**** xref:fall2023/20100/20100-2023-project01.adoc[Project 1] -**** xref:fall2023/20100/20100-2023-project02.adoc[Project 2] -**** xref:fall2023/20100/20100-2023-project03.adoc[Project 3] -**** xref:fall2023/20100/20100-2023-project04.adoc[Project 4] -**** xref:fall2023/20100/20100-2023-project05.adoc[Project 5] -**** xref:fall2023/20100/20100-2023-project06.adoc[Project 6] -**** xref:fall2023/20100/20100-2023-project07.adoc[Project 7] -**** xref:fall2023/20100/20100-2023-project08.adoc[Project 8] -**** xref:fall2023/20100/20100-2023-project09.adoc[Project 9] -**** xref:fall2023/20100/20100-2023-project10.adoc[Project 10] -**** xref:fall2023/20100/20100-2023-project11.adoc[Project 11] -**** xref:fall2023/20100/20100-2023-project12.adoc[Project 12] -**** xref:fall2023/20100/20100-2023-project13.adoc[Project 13] -*** xref:fall2023/30100/30100-2023-projects.adoc[TDM 301] -**** xref:fall2023/logistics/office_hours_301.adoc[Office Hours] -**** xref:fall2023/30100/30100-2023-project01.adoc[Project 1] -**** xref:fall2023/30100/30100-2023-project02.adoc[Project 2] -**** xref:fall2023/30100/30100-2023-project03.adoc[Project 3] -**** xref:fall2023/30100/30100-2023-project04.adoc[Project 4] -**** xref:fall2023/30100/30100-2023-project05.adoc[Project 5] -**** xref:fall2023/30100/30100-2023-project06.adoc[Project 6] -**** xref:fall2023/30100/30100-2023-project07.adoc[Project 7] -**** xref:fall2023/30100/30100-2023-project08.adoc[Project 8] -**** xref:fall2023/30100/30100-2023-project09.adoc[Project 9] -**** xref:fall2023/30100/30100-2023-project10.adoc[Project 10] -**** xref:fall2023/30100/30100-2023-project11.adoc[Project 11] -**** xref:fall2023/30100/30100-2023-project12.adoc[Project 12] -**** xref:fall2023/30100/30100-2023-project13.adoc[Project 13] -*** xref:fall2023/40100/40100-2023-projects.adoc[TDM 401] -**** xref:fall2023/logistics/office_hours_401.adoc[Office Hours] -**** xref:fall2023/40100/40100-2023-project01.adoc[Project 1] -**** xref:fall2023/40100/40100-2023-project02.adoc[Project 2] -**** xref:fall2023/40100/40100-2023-project03.adoc[Project 3] -**** xref:fall2023/40100/40100-2023-project04.adoc[Project 4] -**** xref:fall2023/40100/40100-2023-project05.adoc[Project 5] -**** xref:fall2023/40100/40100-2023-project06.adoc[Project 6] -**** xref:fall2023/40100/40100-2023-project07.adoc[Project 7] -**** xref:fall2023/40100/40100-2023-project08.adoc[Project 8] -**** xref:fall2023/40100/40100-2023-project09.adoc[Project 9] -**** xref:fall2023/40100/40100-2023-project10.adoc[Project 10] -**** xref:fall2023/40100/40100-2023-project11.adoc[Project 11] -**** xref:fall2023/40100/40100-2023-project12.adoc[Project 12] -**** xref:fall2023/40100/40100-2023-project13.adoc[Project 13] -** Spring 2024 -*** xref:spring2024/10200/10200-2024-projects.adoc[TDM 10200] -**** xref:spring2024/10200/10200-2024-project01.adoc[Project 1] -**** xref:spring2024/10200/10200-2024-project02.adoc[Project 2] -**** xref:spring2024/10200/10200-2024-project03.adoc[Project 3] -**** xref:spring2024/10200/10200-2024-project04.adoc[Project 4] -**** xref:spring2024/10200/10200-2024-project05.adoc[Project 5] -**** xref:spring2024/10200/10200-2024-project06.adoc[Project 6] -**** xref:spring2024/10200/10200-2024-project07.adoc[Project 7] -**** xref:spring2024/10200/10200-2024-project08.adoc[Project 8] -**** xref:spring2024/10200/10200-2024-project09.adoc[Project 9] -**** xref:spring2024/10200/10200-2024-project10.adoc[Project 10] -**** xref:spring2024/10200/10200-2024-project11.adoc[Project 11] -**** xref:spring2024/10200/10200-2024-project12.adoc[Project 12] -**** xref:spring2024/10200/10200-2024-project13.adoc[Project 13] -**** xref:spring2024/10200/10200-2024-project14.adoc[Project 14] -*** xref:spring2024/20200/20200-2024-projects.adoc[TDM 20200] -**** xref:spring2024/20200/20200-2024-project01.adoc[Project 1] -**** xref:spring2024/20200/20200-2024-project02.adoc[Project 2] -**** xref:spring2024/20200/20200-2024-project03.adoc[Project 3] -**** xref:spring2024/20200/20200-2024-project04.adoc[Project 4] -**** xref:spring2024/20200/20200-2024-project05.adoc[Project 5] -**** xref:spring2024/20200/20200-2024-project06.adoc[Project 6] -**** xref:spring2024/20200/20200-2024-project07.adoc[Project 7] -**** xref:spring2024/20200/20200-2024-project08.adoc[Project 8] -**** xref:spring2024/20200/20200-2024-project09.adoc[Project 9] -**** xref:spring2024/20200/20200-2024-project10.adoc[Project 10] -**** xref:spring2024/20200/20200-2024-project11.adoc[Project 11] -**** xref:spring2024/20200/20200-2024-project12.adoc[Project 12] -**** xref:spring2024/20200/20200-2024-project13.adoc[Project 13] -**** xref:spring2024/20200/20200-2024-project14.adoc[Project 14] -*** xref:spring2024/30200_40200/30200-2024-projects.adoc[TDM 30200] -*** xref:spring2024/30200_40200/40200-2024-projects.adoc[TDM 40200] -** Think Summer 2024 -*** xref:summer2024/summer-2024-account-setup.adoc[Account Setup] -*** xref:summer2024/summer-2024-project-template.adoc[Project Template] -*** xref:summer2024/summer-2024-project-introduction.adoc[Introduction] -*** xref:summer2024/summer-2024-day1-notes.adoc[Day 1 Notes] -*** xref:summer2024/summer-2024-day2-notes.adoc[Day 2 Notes] -*** xref:summer2024/summer-2024-day3-notes.adoc[Day 3 Notes] -*** xref:summer2024/summer-2024-day4-notes.adoc[Day 4 Notes] -*** xref:summer2024/summer-2024-day5-notes.adoc[Day 5 Notes] -*** xref:summer2024/summer-2024-project-01.adoc[Project 1] -*** xref:summer2024/summer-2024-project-02.adoc[Project 2] -*** xref:summer2024/summer-2024-project-03.adoc[Project 3] -*** xref:summer2024/summer-2024-project-04.adoc[Project 4] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc b/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc deleted file mode 100644 index 4424f189e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2018/tdm201819projects.adoc +++ /dev/null @@ -1,2269 +0,0 @@ -= TDM Fall 2018 STAT 19000 Projects - -== Project 1 - -Question 1. - -Use the airline data stored in this directory: - -`/depot/statclass/data/dataexpo2009` - -In the year 2005, find: - -a. the number of flights that occurred, on every day of the year, and - -b. find the day of the year on which the most flights occur. - -Solution: - -We switch to the directory for the airline data - -`cd /depot/statclass/data/dataexpo2009` - -a. The number of flights that occurred, on every day of the year, can be obtained by extracting the 1st, 2nd, and 3rd fields, sorting the data, and then summarizing the data using the uniq command with the -c flag - -`sort 2005.csv | cut -d, -f1-3 | sort | uniq -c` - -The first few lines of the output are: - -[source,bash] ----- -16477 2005,10,1 -19885 2005,10,10 -19515 2005,10,11 -19701 2005,10,12 -19883 2005,10,13 ----- - -and the last few lines of the output are: - -[source,bash] ----- -20051 2005,9,6 -19629 2005,9,7 -19968 2005,9,8 -19938 2005,9,9 - 1 Year,Month,DayofMonth ----- - -b. The day of the year on which the most flights occur can be found by sorting the results above, in numerical order, using sort -n and then (if desired, although it is optional) we can extract the last line of the output using tail -n1 - -`sort 2005.csv | cut -d, -f1-3 | sort | uniq -c | sort -n | tail -n1` - -and we conclude that the most flights occur on August 5: - -`21041 2005,8,5` - - -Question 2. - -Again considering the year 2005, did United or Delta have more flights? - -Solution: - -We can extract the 9th field, which is the carrier (i.e., the airline company) and then, in the same way as above, we can sort the data, and then we can summarize the data using uniq -c - -This yields the number of flights for each carrier. We can either read the number of United or Delta flights with our eyeballs, or we can use the grep command, searching for both the pattern UA and DL to isolate (only) the number of flights for United and Delta, respectively. - -`sort 2005.csv | cut -d, -f9 | sort | uniq -c | grep "UA\|DL"` - -The output is - -[source,bash] ----- -658302 DL -485918 UA ----- - -so Delta has more flights than United in 2005. - - -Question 3. - -Consider the June 2017 taxi cab data, which is located in this folder: - -`/depot/statclass/data/taxi2018` - -What is the distribution of the number of passengers in the taxi cab rides? In other words, make a list of the number of rides that have 1 passenger; that have 2 passengers; etc. - -Solution: - -Now we change directories to consider the taxi cab data - -`cd ../taxi2018` - -The ".." in the previous command just indicates that we want to go up one level to - -`/depot/statclass/data` - -and then, from that point, we want to go into the taxi cab directory. If this sounds complicated, then (instead) it is safe to use the longer version: - -`cd /depot/statclass/data/taxi2018` - -The number of passengers is given in the 4th column, `passenger_count` - -We use a method that is similar to the one from the first three questions, we extract the 4th column, sort the data, and then summarizing the data using the uniq command with the -c flag - -`sort yellow_tripdata_2017-06.csv | cut -d, -f4 | sort | uniq -c` - -and the distribution of the number of passengers is: - -[source,bash] ----- - 1 - 548 0 -6933189 1 -1385066 2 - 406162 3 - 187979 4 - 455753 5 - 288220 6 - 26 7 - 30 8 - 20 9 - 1 passenger_count ----- - -Notice that we have some extraneous information, i.e., there is one blank line and also one line for the passenger_count (from the header) - - -== Project 2 - -Question 1. - -Use the airline data stored in this directory: - -`/depot/statclass/data/dataexpo2009` - -a. What was the average arrival delay (in minutes) for flights in 2005? - -b. What was the average departure delay (in minutes) for flights in 2005? - -cd. Now revise your solution to 1ab, to account for the delays (of both types) in the full set of data, across all years. - - -Question 2. - -Revise your solutions to 1abcd to only include flights that took place on the weekends. - -Question 3. - -Consider the June 2017 taxi cab data, which is located in this folder: - -`/depot/statclass/data/taxi2018` - -What is the average distance of a taxi cab ride in New York City in June 2017? - - -== Project 3 - -Use R to revisit these questions. They can each be accomplished with 1 line of code. - -Question 1. - -As in Project 1, question 2: In the year 2005, did United or Delta have more flights? - -Question 2. - -As in Project 2, question 2a: Restricting attention to weekends (only), what was the average arrival delay (in minutes) for flights in 2005? - -Question 3. - -As in Project 1, question 3: In June 2017, what is the distribution of the number of passengers in the taxi cab rides? - -Question 4. - -As in Project 2, question 3: What is the average distance of a taxi cab ride in New York City in June 2017? - - - - -== Project 4 - -Revisit the map code on the STAT 19000 webpage: - -http://www.stat.purdue.edu/datamine/19000/ - -Goal: Make a map of the State of Indiana, which shows all of Indiana's airports. - -Notes: - -You will need to install the ggmap package, which takes a few minutes to install. - -You can read in the data about the airports from the Data Expo 2009 Supplementary Data: - -http://stat-computing.org/dataexpo/2009/supplemental-data.html - -It will be necessary to extract (only) the airports with "state" equal to "IN" - -It is possible to either dynamically load the longitude and latitude of Indianapolis from Google, - -or to manually specify the longitude and latitude (e.g., by looking them up yourself in Google and entering them). - -After you plot the State of Indiana with all of the airports shown, - -you can print the resulting plot to a pdf file as follows: - -dev.print(pdf, "filename.pdf") - -Please submit your GitHub code in a ".R" file and also the resulting ".pdf" file. - -It is not (yet) necessary to submit your work in RMarkdown. - - - -== Project 5 - -Question 1. - -a. Compute the average distance for the flights on each airline in 2005. - -b. Sort the result from 1a, and make a dotchart to display the results in sorted order. (Please display all of the values in the dotchart.) - -Hint: You can use: - -`?dotchart` - -if you want to read more about how to make a dotchart about the data. - - -Question 2. - -a. Compute the average total amount of the cost of taxi rides in June 2017, for each pickup location ID. You can see which variables have the total amount of the cost of the ride, as well as the pickup location ID, if you look at the data dictionary for the yellow taxi cab rides, which you can download here: `http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml` - -b. Sort the result from 2a, and make a dotchart to display the results in sorted order. (Please ONLY display the results with value bigger than 80.) - -Question 3. - -Put the two questions above -- including your comments -- into an RMarkdown file. Submit the .Rmd file itself and either the html or pdf output, when you submit your project in GitHub. - - - -== Project 6 - -Consider the election donation data: - -https://www.fec.gov/data/advanced/?tab=bulk-data - -from "Contributions by individuals" for 2017-18. Download this data. - - -Unzip the file (in the terminal). - -Use the cat command to concatenate all of the files in the by_date folder into one large file (in the terminal). - -Read the data dictionary: - -https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/ - - -Hint: When working with a file that is not comma separated, you can use the read.delim command in R, and *be sure to specify* the character that separates the various pieces of data on a row. -To do this, you can read the help file for read.delim by typing: ?read.delim -(Look for the "field separator character".) - -Also there is no header, so also use header=F - - -Question 1. - -Rank the states according to how many times that their citizens contributed (i.e., total number of donations). Which 5 states made the largest numbers of contributions? - -Question 2. - -Use awk in the terminal to verify your solution to question 1. - -Question 3. - -Now (instead) rank the states according to how much money their citizens contributed (i.e., total amount of donations). Which 5 states contributed the largest amount of money? - -(Optional!!) challenge question: Use awk in the terminal to verify your solution to question 3. -This can be done with 1 line of awk code, but you need to use arrays in awk, -as demonstrated (for instance) on Andrey's solution on this page: - -https://unix.stackexchange.com/questions/242946/using-awk-to-sum-the-values-of-a-column-based-on-the-values-of-another-column/242949 - -Submit your solutions in RMarkdown. -For question 2 (and for the optional challenge question), it is OK to just -put your code into your comments in RMarkdown, -so that the TA's can see how you solved question 2, -but (of course) the awk code does not run in RMarkdown! -You are just showing the awk code to the TA's in this way! - - -== Project 7 - -Consider the Lahman baseball database available at: -http://www.seanlahman.com/baseball-archive/statistics/ - -Download the 2017 comma-delimited version and unzip it. -Inside the "core" folder of the unzipped file, you will find many csv files. - -If you want to better understand the contents of the files, -there is a helpful readme file available here: -http://www.seanlahman.com/files/database/readme2017.txt - -Question 1. - -Use the Batting.csv file (inside the "core" folder) to discover who is a member of the 40-40 club, namely, who has hit 40 home runs and also has (simultaneously) stolen 40 bases in the same season. -Hint: There are multiple ways to solve this question. It is not necessary to use a tapply function. This can be done with one line of code. - -Question 2. - -Make a plot that depicts the total number of home runs per year (across all players on all teams). The plot should have the years as the labels for the x-axis, and should have the number of home runs as the labels for the y-axis. -Hints: Use the tapply function. Save the results of the tapply function in a vector v. If do this, then names(v) will have a list of the years. The plot command has options that include xlab and ylab, so that you can put intelligent labels on the axes, for instance, you can label the x-axis as "years" and the y-axis as "HR". - -Question 3. - -a. Try this example: Store the Batting table into a data frame called myBatting. Store the People table into a date frame called myPeople. Merge the two data frames into a new data frame, using the "merge" function: `myDF <- merge(myBatting, myPeople, by="playerID")` - -b. Use the paste command to paste the first and last name columns from myDF into a new vector. Save this new vector as a new column in the data frame myDF. - -c. Return to question 1, and resolve it. Now we can see the person's full name instead of their playerID. - - - -Fun Side Project (to accompany Project 7) - -Not required, but fun! - -read `Teams.csv` file into a `data.frame` called myDF - -break the data.frame into smaller data frames, -according to the `teamID`, using this code: - -`by(myDF, myDF$teamID, function(x) {plot(x$W)} )` - -For each team, this draws 1 plot of the number of wins per year. The number of wins will be on the y-axis of the plots. - -For an improved version, we can add the years on the x-axis, as follows: - -`by(myDF, myDF$teamID, function(x) {plot(x$year, x$W)} )` - -Change your working directory in R to a new folder, using the menu option: - -`Session -> Set Working Directory -> Choose Directory` - -We are going to make 149 new plots! - -After changing the directory, try this code, which makes 149 separate pdf files: - -`by(myDF, myDF$teamID, function(x) {pdf(as.character(x$teamID[1])); plot(x$year, x$W); dev.off()} )` - - -== SQL Example 1 - -We only need to install this package 1 time. - -`install.packages("RMySQL")` - -No need to run the line above, if you already ran it. - -We need to run this library every time we load R. - -[source,r] ----- -library("RMySQL") -myconnection <- dbConnect(dbDriver("MySQL"), - host="mydb.ics.purdue.edu", - username="mdw_guest", - password="MDW_csp2018", - dbname="mdw") - -easyquery <- function(x) { - fetch(dbSendQuery(myconnection, x), n=-1) -} ----- - -Here are the players from the Boston Red Sox in the year 2008 - -[source,r] ----- -myDF <- easyquery("SELECT m.playerID, b.yearID, b.teamID, - m.nameFirst, m.nameLast - FROM Batting b JOIN Master m - ON b.playerID = m.playerID - WHERE b.teamID = 'BOS' - AND b.yearID = 2008;") -myDF ----- - -== SQL Example 2 - -We only need to install this package 1 time. - -`install.packages("RMySQL")` - -No need to run the line above, if you already ran it. - -We need to run this library every time we load R. - -[source,r] ----- -library("RMySQL") -myconnection <- dbConnect(dbDriver("MySQL"), - host="mydb.ics.purdue.edu", - username="mdw_guest", - password="MDW_csp2018", - dbname="mdw") - -easyquery <- function(x) { - fetch(dbSendQuery(myconnection, x), n=-1) -} ----- - -Here are the total number of home runs hit by each player in their entire career - -[source,r] ----- -myDF <- easyquery("SELECT m.nameFirst, m.nameLast, - b.playerID, SUM(b.HR) - FROM Batting b JOIN Master m - ON m.playerID = b.playerID - GROUP BY b.playerID;") - -myDF ----- - -Here are the players who hit more than 600 home runs in their careers - -`myDF[ myDF$"SUM(b.HR)" >= 600, ]` - -== SQL Example 3 - -We only need to install this package 1 time. - -`install.packages("RMySQL")` - -No need to run the line above, if you already ran it. - -We need to run this library every time we load R. - -[source,r] ----- -library("RMySQL") -myconnection <- dbConnect(dbDriver("MySQL"), - host="mydb.ics.purdue.edu", - username="mdw_guest", - password="MDW_csp2018", - dbname="mdw") - -easyquery <- function(x) { - fetch(dbSendQuery(myconnection, x), n=-1) -} ----- - -Here is basic version for the players who have more than 60 Home Runs during one season. - -[source,r] ----- -myDF <- easyquery("SELECT b.playerID, b.yearID, b.HR - FROM Batting b - WHERE b.HR >= 60;") - -myDF ----- - -Here is an improved version, which includes the Batting and the Master table, so that we can have the players' full names. - -[source,r] ----- -myDF <- easyquery("SELECT m.nameFirst, m.nameLast, - b.playerID, b.yearID, b.HR - FROM Master m JOIN Batting b - ON m.playerID = b.playerID - WHERE b.HR >= 60;") - -myDF ----- - -== SQL Example 4 - -We only need to install this package 1 time. - -`install.packages("RMySQL")` - -No need to run the line above, if you already ran it. - -We need to run this library every time we load R. - -[source,r] ----- -library("RMySQL") -myconnection <- dbConnect(dbDriver("MySQL"), - host="mydb.ics.purdue.edu", - username="mdw_guest", - password="MDW_csp2018", - dbname="mdw") - -easyquery <- function(x) { - fetch(dbSendQuery(myconnection, x), n=-1) -} ----- - -Here is basic version for the 40-40 club question. (Same question as last week.) - -[source,r] ----- -myDF <- easyquery("SELECT b.playerID, b.yearID, b.SB, b.HR - FROM Batting b - WHERE b.SB >= 40 AND b.HR >= 40;") - -myDF ----- - -Here is an improved version, which includes the Batting and the Master table, so that we can have the players' full names. - -[source,r] ----- -myDF <- easyquery("SELECT m.nameFirst, m.nameLast, - b.yearID, b.SB, b.HR - FROM Master m JOIN Batting b - ON m.playerID = b.playerID - WHERE b.SB >= 40 AND b.HR >= 40;") - -myDF ----- - -Here is a further improved version, which includes the Batting, Master, and Teams table, so that we can have the players' full names, and the teams that they played on. - -[source,r] ----- -myDF <- easyquery("SELECT m.nameFirst, m.nameLast, - b.yearID, b.SB, b.HR, t.name - FROM Master m JOIN Batting b - ON m.playerID = b.playerID - JOIN Teams t - ON b.yearID = t.yearID - AND b.teamID = t.teamID - WHERE b.SB >= 40 AND b.HR >= 40;") -myDF ----- - - - - -== Project 8 - -Question 1. - -Modify SQL Example 2 to find the Pitcher who has the most Strikeouts in his career. - -Hint: You need to use a "Pitching p" table instead of a "Batting b" table. - -Hint: The strikeouts are in column "SO" of the Pitching table. - -Hint: This pitcher is named "Nolan Ryan"... but you need to use SQL to figure that out. - -I am just trying to give you a way to know when you are correct. - -Please momentarily forget that I am giving you the answer at the start! - -Question 2. - -Which years was Nolan Ryan a pitcher? - -For this project, to make your life easier, it is OK to just submit a regular R file, rather than an RMarkdown file. - - -== Project 9 - -(Please remember that you have a "ReadMe" file, posted on Piazza last week, which tells you about all of the tables, including the table that tells you where the students went to school.) - -1. Find the first and last names of all players who attended Purdue. - -2. Find all of the pitchers who have pitched 300 or more strikeouts during a single season. - -In the output, give their first and last name and the year in which this achievement occurred. -(You can just modify Example 3.) - -3a. Modify Example 5 to find out which pitchers were able to achieve 300 or more strikeouts AND 20 or more wins during the same season. - -3b. Consider the years in which this achievement occurred. Use R to find the list of distinct years in which this achievement occurred at least once. - -Background discussion: - -If you look at the example for the 40-40 club (in Example 4), it works because each time that a player achieved 40 (or more) HR's and 40 (or more) SB's during the same season, he was only playing for one team. A player never got traded to a new team, in any of those years. Some complications will arise if a player switches teams (i.e., gets traded) during the season. For this reason, we introduce Example 5. - -Here are some notes about Example 5: - -If we incorporate the SUM function into a condition, for instance, `WHERE SUM(b.SB) >= 40` the query will not work. Instead, if the condition has a `SUM` inside it, we change `WHERE` to `HAVING`. See Example 5 as a perfect example of this. We can also return the results in a given order, using: `ORDER BY` for instance, `ORDER BY by.yearID` if we want to get the results (say) in order by the year. - -== SQL Example 5 - -We only need to install this package 1 time. - -`install.packages("RMySQL")` - - No need to run the line above, if you already ran it. - -We need to run this library every time we load R. - -`library("RMySQL")` - -[source,r] ----- -myconnection <- dbConnect(dbDriver("MySQL"), - host="mydb.ics.purdue.edu", - username="mdw_guest", - password="MDW_csp2018", - dbname="mdw") - -easyquery <- function(x) { - fetch(dbSendQuery(myconnection, x), n=-1) -} ----- - -Here is basic version for the 30-30 club question. -(Same question as last week.) - - -[source,r] ----- -myDF <- easyquery("SELECT b.playerID, b.yearID, SUM(b.SB), SUM(b.HR) - FROM Batting b - GROUP BY b.playerID, b.yearID - HAVING SUM(b.SB) >= 30 AND SUM(b.HR) >= 30 - ORDER BY b.yearID;") -myDF ----- - -Here is an improved version, which includes the Batting and the Master table, -so that we can have the players' full names. - - -[source,r] ----- -myDF <- easyquery("SELECT m.nameFirst, m.nameLast, - b.yearID, SUM(b.SB), SUM(b.HR) - FROM Master m JOIN Batting b - ON m.playerID = b.playerID - GROUP BY b.playerID, b.yearID - HAVING SUM(b.SB) >= 30 AND SUM(b.HR) >= 30 - ORDER BY b.yearID;") -myDF ----- - - -== Project 10 - -Use the results of the National Park Service scraping example to answer the following two questions: - -1. Which states have at least 20 NPS properties? - -2. One zip code has 13 properties in the same zip code! What are the names of those 13 properties? - -If you want to learn XPath (as demonstrated in the case study) to scrape data from a website of your choice, you can make up the grades from 1 or 2 of the previous projects. If you scrape at least 500 pieces of data from the XML of a page,you can replace the grade from 1 previous project. If you scrape at least 1000 pieces of data from the XML of a page, you can replace the grade from 2 previous projects. Your project plan will require written approval from Dr Ward, and it will require you to scrape the data from XML itself (not just download the data). - -case study: scraping National Park Service data - -[source,r] ----- -# This is a short project to download the data about the -# properties in the National Park Service (NPS). -# They are all online through the office NPS webpage: -# https://www.nps.gov/findapark/index.htm -# (Please note that some parks extend into more than one state.) - -# At the end of the project, when we export the data, -# we do not want to use comma-separated values (i.e., a csv file) -# because there are also some commas in our data. -# So we will use tabs as our delimiter at the end of this process. - -# We will use the RCurl package to download the NPS files. -# Normally we could just parse the XML (or html) content -# on-the-fly, without downloading the files, but in this case, -# it wasn't working on about 10 of the files, and somehow -# when I downloaded the files, it worked completely. -# I tried this several times, and just going ahead and downloading -# the files seems to be the most consistent solution. -install.packages("RCurl") -library(RCurl) - -# We will use the XML package to parse the html (or XML) data -install.packages("XML") -library(XML) - -# We will use the xlsx package to export the results at the end, -# into an xlsx file, for viewing in Microsoft Excel, if desired. -install.packages("xlsx") -library(xlsx) - -# To see the list of the parks, we can go here: -# https://www.nps.gov/findapark/index.htm -# in any browser. -# In most browsers, if you navigate to a page and then type: -# Control-U (i.e., the Control Key and the letter U Key at once) -# on a Windows or UNIX machine, -# or if you type Command-U (i.e., the Command Key and the letter U Key at once) -# on an Apple Macintosh machine, -# then you can see the code for the way that the webpage is created. - -# This webpage that I mentioned: -# https://www.nps.gov/findapark/index.htm -# has 1489 lines of code. Wow. - - -# From (roughly) lines 206 through 756, we see that the -# data for the parks are wrapped in a "div" (on line 206) -# and then in a "select" (on line 208) -# and then in an "optgroup" and then an "option". -# We want to extract the "value" of each "option". -# (We skip the "label" on line 205 because it ends on line 205 too.) -# So we do the following: - -myparks <- xpathSApply(htmlParse(getURL("https://www.nps.gov/findapark/index.htm")), "//*/div/select/optgroup/option", xmlGetAttr, "value") -myparks - -# If the line of code (above) doesn't work, -# then perhaps you forgot to actually run the three "library" commands -# near the start of the file. - -# We did a lot of things with 1 line of code. -# The "getURL" temporarily downloads all of the code from this webpage. -# We do not save the webpage, but rather, we send it to the htmlParse command. -# Once the page is parsed, we send the parsed results to the xpathSApply command. -# The pattern we want to look for is: -# "//*/div/select/optgroup/option" -# The star means that anything is OK before this chunk of the pattern, -# but we definitely want our pattern to end with /div/select/optgroup/option -# and then we get the xmlGetAttr attribute called "value" -# which is one of the parks. - -# When we check the results, we got 498 results: -length(myparks) - -# For the Abraham Lincoln Birthplace, we want to run the following command, -# so that we are prepared to download the webpage. -# After downloading it, we will extract information from the parsed page: -system("mkdir ~/Desktop/myparks/") -download.file("https://www.nps.gov/abli/index.htm", "~/Desktop/myparks/abli.htm") -htmlParse("~/Desktop/myparks/abli.htm") - -# but we want to do that for each park. -# So we build the following function: -myparser <- function(x) { - download.file(paste("https://www.nps.gov/", x, "/index.htm", sep=""), paste("~/Desktop/myparks/", x, ".htm", sep="")) - htmlParse(paste("~/Desktop/myparks/", x, ".htm", sep="")) -} - -# Now, we apply this function to each element of "myparks" -# and we save the results in a variable called "mydocs": -mydocs <- sapply(myparks, myparser) - -# The webpage for the Abraham Lincoln Birthplace is now parsed and stored here: -mydocs[[1]] -# The webpage for Zion National Park is now parsed and stored here: -mydocs[[498]] - -# Next we look at the source for the Abraham Lincoln Birthplace: -# https://www.nps.gov/abli/index.htm -# We load that webpage in any browser and then type: -# Control-U if we are on a Windows or UNIX machine, or -# Command-U if we are on a Mac. - -# Then we can search in this page (using Control-F on Windows or UNIX, -# or using Command-F on a Mac) for any pattern we want. -# If we search for "itemprop" -# we find the information about the address: - -# They are all within a "span" tag, with different "itemprop" attributes: -# The street address has attribute: "streetAddress" -# The city has attribute: "addressLocality" -# The state has attribute: "addressRegion" -# The zip code has attribute: "postalCode" -# The telephone has attribute: "telephone" - -# So, for instance, we can find all of these as follows: -xpathSApply(mydocs[[1]], "//*/span[@itemprop='streetAddress']", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@itemprop='addressLocality']", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@itemprop='addressRegion']", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@itemprop='postalCode']", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@itemprop='telephone']", xmlValue) - -# Then the title stuff: - -xpathSApply(mydocs[[1]], "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@class='Hero-designation']", xmlValue) -xpathSApply(mydocs[[1]], "//*/span[@class='Hero-location']", xmlValue) - -# and, finally, the social media links: - -paste(xpathSApply(mydocs[[1]], "//*/div/ul/li[@class='col-xs-6 col-sm-12 col-md-6']/a", xmlGetAttr, "href"),collapse=",") - - -# Here are the versions for the entire data set: - -streets <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='streetAddress']", xmlValue)) -cities <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='addressLocality']", xmlValue)) -states <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='addressRegion']", xmlValue)) -zips <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='postalCode']", xmlValue)) -phones <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@itemprop='telephone']", xmlValue)) - -mynames <- sapply(mydocs, function(x) xpathSApply(x, "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)) -mytypes <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@class='Hero-designation']", xmlValue)) -mylocations <- sapply(mydocs, function(x) xpathSApply(x, "//*/span[@class='Hero-location']", xmlValue)) - -mylinks <- sapply(mydocs, function(x) paste(xpathSApply(x, "//*/div/ul/li[@class='col-xs-6 col-sm-12 col-md-6']/a", xmlGetAttr, "href"),collapse=",")) - -# with some cleaning up: - -streets <- sapply(streets, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -cities <- sapply(cities, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -states <- sapply(states, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -zips <- sapply(zips, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -phones <- sapply(phones, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -mynames <- sapply(mynames, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -mytypes <- sapply(mytypes, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -mylocations <- sapply(mylocations, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) -mylinks <- sapply(mylinks, function(x) ifelse(length(x)==0,NA,sub("^\\s+","",sub("\\s+$","",x))), simplify=FALSE) - -myDF <- data.frame( -streets=do.call(rbind,streets), -cities=do.call(rbind,cities), -states=do.call(rbind,states), -zips=do.call(rbind,zips), -phones=do.call(rbind,phones), -mynames=do.call(rbind,mynames), -mytypes=do.call(rbind,mytypes), -mylocations=do.call(rbind,mylocations), -mylinks=do.call(rbind,mylinks) -) ----- - -== Project 11: - -The names in the election data are in CAPITAL LETTERS! - -When asking about names in the questions, we assume that you are using the names from the election data, available on Scholar. - -You might want to practice on a smaller data set: `/depot/statclass/data/election2018/itsmall.txt` - -The full data is available here: `/depot/statclass/data/election2018/itcont.txt` - -We are assuming that you are using unique names from column 8, i.e., that you have already removed duplicates of any names of the donors. - -Hint: Save column 8 (which contains the donor names) into a new variable. Then extract the unique values from the column using the "unique" command. - -Answer these questions using the full data given above. BUT, for convenience, you might want to *start* by using the smaller data set to practice. - -Please note that we can read the data into R using the command: - -`myDF <- read.csv("/depot/statclass/data/election2018/itsmall.txt", header=F, sep="|")` - -or, for the full data set: - -`myDF <- read.csv("/depot/statclass/data/election2018/itcont.txt", header=F, sep="|")` - -1. Find the number of (unique) donor names who have your first name, - embedded somewhere in the donor's name (not necessarily as the - first or last name--any location is OK). - -2. a. How many donors have a consecutive repeated letter in their name? b. How many donors have a consecutive repeated vowel in their name? c. How many donors have a consecutive repeated consonant in their name? - -3. Just for fun: Come up with an interesting question about text patterns, and answer it yourself, using regular expressions. Of course you can compare questions and answers with another member of The Data Mine. Have fun! - -[source,bash] ----- - -Regular expressions enable us to find patterns in text. -Here are a handful of examples of regular expressions. - -The best way to learn them in earnest is to just read some documentation about regular expressions and then try them! - -Here is an example: -v <- c("me", "you", "mark", "laura", "kale", "emma", "err", "eat", "queue", "kangaroo", "kangarooooo", "kangarooooooooo") - -The elements of v that contain the letter "m": -v[grep("m", v)] - -containing the phrase "me": -v[grep("me", v)] - -containing the letter "a": -v[grep("a", v)] - -containing the letter "e": -v[grep("e", v)] - -containing the letter "k": -v[grep("k", v)] - -containing the letter "k" at the start of the word: -v[grep("^k", v)] - -containing the letter "k" at the end of the word: -v[grep("k$", v)] - -containing the letter "a" at the end of the word: -v[grep("a$", v)] - -containing the letter "o" at the end of the word: -v[grep("o$", v)] - -containing the letter "o" anywhere in the word: -v[grep("o", v)] - -containing the letter "o" two times in a row, anywhere in the word: -v[grep("o{2}", v)] - -containing the letter "o" three times in a row, anywhere in the word: -v[grep("o{3}", v)] - -containing the letter "o" two to five times in a row, anywhere in the word: -v[grep("o{2,5}", v)] - -containing the letter "q" followed by "ue": -v[grep("q(ue){1}", v)] - -containing the letter "q" followed by "ue" two times: -v[grep("q(ue){2}", v)] - -containing the letter "q" followed by "ue" three times: -v[grep("q(ue){3}", v)] - -containing the letter "e" followed by "m" or "r": -v[grep("e(m|r)", v)] - -again, same idea, but different way, to find words -containing the letter "e" followed by "m" or "r": -v[grep("e[mr]", v)] - -containing the letter "e" followed by "ma" or "rr": -v[grep("e(ma|rr)", v)] - -containing a repeated letter: -v[grep("([a-z])\\1", v)] -In this example, the \\1 refers to whatever was found in the first match -(which is just given in parentheses for convenience) - -Here is a summary of regular expressions: - -https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285 - -You are welcome to use any source or reference for regular expressions that you like. - -We need to use double backslash for back-references, in R. -We gave a demonstration of this, in the last example given above. -In general, in R, when writing a backslash in a regular expression, a double backslash is usually needed. ----- - - -== Project 12 - -There was no project 12 - -== Project 13 - -There was no project 13 - -== Project 14 - -Remind ourselves how to use bash and awk tools (previously we did this in the terminal). - -We will do it in Jupyter Notebooks this semester: `http://notebook.scholar.rcac.purdue.edu/` - -1. a. Start a new Jupyter Notebook with type "bash" (instead of "R"). We are going to put bash code directly inside the Jupyter Notebook. (In the past, we only wrote bash code directly inside the terminal.) b. Look at the first 10 lines of the 2007 flight data, which is found at: `/depot/statclass/data/dataexpo2009/2007.csv` All of the flights in those first 10 lines are on the same carrier. Which carrier is it? Remember that you can check: `http://stat-computing.org/dataexpo/2009/the-data.html` Now we are going to put awk code directly inside the Jupyter Notebook. (In the past, we only wrote awk code directly inside the terminal.) - -2. Save the information about every flight departing from Indianapolis since January 1, 2000 into a common file, named `MyIndyFlights.csv` - -Hint 1: You only need the files 2000.csv, 2001.csv, ..., 2008.csv You can work on all of those files at once, using 2*.csv because the "*" is like a wildcard, that matches any pattern. - -Hint 2: You can use awk to do this. For comparison, ONLY as an example, we can extract all flights -that are on Delta airlines in 1998 as follows: -`cat /depot/statclass/data/dataexpo2009/1998.csv | awk -F, '{ if($9 == "DL") {print $0} }' >MyDeltaFlights.csv` - -== Project 14 Solutions - - -[source,bash] ----- -# 1. The head of the file with the 2007 flights is: -head /depot/statclass/data/dataexpo2009/2007.csv - -# We see that the UniqueCarrier is found in column 9. -# One way to extract the UniqueCarrier is with the cut command -# using a comma as the delimiter and retrieving (cut out) the 9th column: -cut -d, -f9 /depot/statclass/data/dataexpo2009/2007.csv | head -n11 -# We only displayed the head, because we only want the first 10 flights. -# We specified -n11 because this prints the first 11 lines of the file, -# namely, the header itself, and the first 10 flights. -# We can check the data dictionary, available at: http://stat-computing.org/dataexpo/2009/ -# The information about the carrier codes is found there, -# by clicking on the link for supplemental data sources: http://stat-computing.org/dataexpo/2009/supplemental-data.html -and then choosing the carriers file: http://stat-computing.org/dataexpo/2009/carriers.csv -# The carrier code "WN" for each of these first ten flights is Southwest. - -# 2. We save the information about the Indianapolis flights by using awk. -# First we recall how to see the information about all such flights. -# Here are the first 10 lines of that data. -cat /depot/statclass/data/dataexpo2009/2*.csv | head -# Then we change the "head" to the "awk" command. -# We use comma as the field separator -# (this is the same as the role of the delimiter from cut) -# We modify the example from the project assignment, -# so that we focus on the 17th field (which are the Origin airports) -# and we save the resulting data into a file called MyIndyFlights.csv - -cat /depot/statclass/data/dataexpo2009/2*.csv | awk -F, '{ if($17 == "IND") {print $0} }' >MyIndyFlights.csv -# Some of you were not working in your home directory when you ran this commmand. -# If you want to be sure to save the file into your home directory, -# remember that you can explicitly specify your home directory using a tilde, as follows: -cat /depot/statclass/data/dataexpo2009/2*.csv | awk -F, '{ if($17 == "IND") {print $0} }' >~/MyIndyFlights.csv -# It is not required that you check things, -# but if you want to check that things worked properly, you can use the wc command -# which gives the number of lines, words, and bytes in the resulting file: -wc MyIndyFlights.csv -# or, even more explicitly, -wc ~/MyIndyFlights.csv -# An alternative is to check the head and the tail: -head MyIndyFlights.csv -tail MyIndyFlights.csv -# or, even more explicitly, -head ~/MyIndyFlights.csv -tail ~/MyIndyFlights.csv ----- - -== Project 15 - -Remind ourselves how to use R tools (previously we did this in the terminal). We will do it in Jupyter Notebooks this semester. - -Question 1 - -a. Start a new Jupyter Notebook with type "R" - -b. Import the flight data from the file MyIndyFlights.csv in a data frame. You just created this file in Project 14. It contains all of the flights that departed from Indianapolis since January 1, 2000. (There should be 356561 flights altogether, and there is no header.) Hint: When you import the data, if you use the read.csv command, there is no header, so be sure to use header=FALSE. - -c. What are the five most popular destinations for travelers who depart Indianapolis since January 1, 2000? List each of these 5 destinations, and the number of flights to each one. - - -Question 2 - -a. Consider the year 2005 (only). Tabulate the number of flights per day. - -b. On each of the most popular five days, how many flights are there? - -c. On each of the least popular five days, how many flights are there? - -Hint: You might be surprised to see the wide range of the number of flights per day! - -== Project 15 Solutions - - -[source,R] ----- - -# 1. We first import the flight data from the file MyIndyFlights.csv - -myDF <- read.csv("MyIndyFlights.csv", header=F) - -# or, if you prefer to explicitly state that the file -# is in your home directory, you can add the tilde for your home: - -myDF <- read.csv("~/MyIndyFlights.csv", header=F) - -# We check that there are 356561 flights altogether: - -dim(myDF) - -# The five most popular destinations for travelers -# who depart Indianapolis since January 1, 2000 are: -tail(sort(table(myDF[[18]])),n=5) - -# We used the 18th column, which has the Destination airports. -# We tabulated the results, using the table command, -# and then we sorted the results. -# Finally, at the end, we took the tail of the results, -# using n=5, since we wanted to see the largest 5 values. - -# 2a. We load the 2005 data: - -myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv") - -# To get the number of flights per day, -# we can first paste together the Month and Day columns. -# We check the head, to make sure that this worked: - -head(paste(myDF$Month, myDF$DayofMonth)) - -# It is also possible, for instance, to separate the -# month and the day by separators, such as a slash: - -head(paste(myDF$Month, myDF$DayofMonth, sep="/")) - -# or a dash: - -head(paste(myDF$Month, myDF$DayofMonth, sep="-")) - -# Now we can tabulate the number of flights per day, -# using the table command: - -table(paste(myDF$Month, myDF$DayofMonth, sep="/")) - -# 2b. To find the most popular five days, -# we can sort the table, and then consider the tail, -# using the n=5 option, -# since we only want the 5 most popular dates. - -tail(sort(table(paste(myDF$Month, myDF$DayofMonth, sep="/"))),n=5) - -# 2c. We just change tail to head, -# to find the 5 least popular dates: - -head(sort(table(paste(myDF$Month, myDF$DayofMonth, sep="/"))),n=5) - ----- - - -== Project 16 - -Project 16 needs to be saved as a `.ipynb` file. This is different from the previous two assignments where the file was uploaded directly from each students Github page. Students need to download it from this link. Thanks! - -https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw16.ipynb - -Question 1 - -Consider the flights from 2005 in the Data Expo 2009 data set. The actual departure times, as you know, are given in the DepTime column. In this question, we want to categorize the departure times according to the hour of departure. For instance, any time in the 4 o'clock in the (very early morning) hour should be classified together. These are the times between 0400 and 0459 (because the times are given in military time). One way to do this is to divide each of the times by 100, and then to take the "floor" of the results, and then make a "table" of the results. For practice (just to understand things), give this a try with the head of the DepTime, one step at a time, to make sure that you understand what is happening. Then: a. Classify all of the 2005 departure times, according to the hour of departure, using this method. b. During which hour of the day did the most flights depart? - -Question 2 - -a. Here is another way to solve the question above. Read the documentation for the "cut" command. For the "breaks" parameter, use: -seq(0, 2900, by=100) -and be sure to set the parameter "right" to be FALSE. - -b. Check that you get the same result as in question 1, using this method. - -c. Why did we choose to use 2900 instead of (say) 2400 in this method? - -== Project 16 Solutions - - -[source,R] ----- -# 1a. We read the data from the 2005 flights into a data frame - -myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv") - -# Then we divide each time by 100 and take the floor: - -table(floor(myDF$DepTime/100)) - -# and we get: - -# 0 1 2 3 4 5 6 7 8 9 10 -# 21747 7092 2027 458 1610 114469 430723 440532 469386 447705 432526 -# 11 12 13 14 15 16 17 18 19 20 21 -# 446432 443252 440903 416661 441021 424299 457678 431613 390398 321680 235810 -# 22 23 24 25 26 27 28 -# 128382 58386 1711 301 56 7 1 - -# 1b. The most flights departed during 8 AM to 9 AM; - -sort(table(floor(myDF$DepTime/100))) - -# 28 27 26 25 3 4 24 2 1 0 23 -# 1 7 56 301 458 1610 1711 2027 7092 21747 58386 -# 5 22 21 20 19 14 16 6 18 10 7 -# 114469 128382 235810 321680 390398 416661 424299 430723 431613 432526 440532 -# 13 15 12 11 9 17 8 -# 440903 441021 443252 446432 447705 457678 469386 - -# 2a. We cut the DepTime column, using the breaks of 0000 through 2900 - -table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), right=FALSE)) - -# and we get: - -# [0,100) [100,200) [200,300) [300,400) -# 21747 7092 2027 458 -# [400,500) [500,600) [600,700) [700,800) -# 1610 114469 430723 440532 -# [800,900) [900,1e+03) [1e+03,1.1e+03) [1.1e+03,1.2e+03) -# 469386 447705 432526 446432 -# [1.2e+03,1.3e+03) [1.3e+03,1.4e+03) [1.4e+03,1.5e+03) [1.5e+03,1.6e+03) -# 443252 440903 416661 441021 -# [1.6e+03,1.7e+03) [1.7e+03,1.8e+03) [1.8e+03,1.9e+03) [1.9e+03,2e+03) -# 424299 457678 431613 390398 -# [2e+03,2.1e+03) [2.1e+03,2.2e+03) [2.2e+03,2.3e+03) [2.3e+03,2.4e+03) -# 321680 235810 128382 58386 -# [2.4e+03,2.5e+03) [2.5e+03,2.6e+03) [2.6e+03,2.7e+03) [2.7e+03,2.8e+03) -# 1711 301 56 7 -# [2.8e+03,2.9e+03) -# 1 - -# or if you want to re-format the output, you can write, for instance: - -table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), dig.lab=4, right=FALSE)) - -# [0,100) [100,200) [200,300) [300,400) [400,500) [500,600) -# 21747 7092 2027 458 1610 114469 -# [600,700) [700,800) [800,900) [900,1000) [1000,1100) [1100,1200) -# 430723 440532 469386 447705 432526 446432 -# [1200,1300) [1300,1400) [1400,1500) [1500,1600) [1600,1700) [1700,1800) -# 443252 440903 416661 441021 424299 457678 -# [1800,1900) [1900,2000) [2000,2100) [2100,2200) [2200,2300) [2300,2400) -# 431613 390398 321680 235810 128382 58386 -# [2400,2500) [2500,2600) [2600,2700) [2700,2800) [2800,2900) -# 1711 301 56 7 1 - -# We just sort the command above, and we see that -# the most flights departed during 8 AM to 9 AM - -sort(table(cut(myDF$DepTime, breaks=seq(0000,2900,by=100), dig.lab=4, right=FALSE))) - -# [2800,2900) [2700,2800) [2600,2700) [2500,2600) [300,400) [400,500) -# 1 7 56 301 458 1610 -# [2400,2500) [200,300) [100,200) [0,100) [2300,2400) [500,600) -# 1711 2027 7092 21747 58386 114469 -# [2200,2300) [2100,2200) [2000,2100) [1900,2000) [1400,1500) [1600,1700) -# 128382 235810 321680 390398 416661 424299 -# [600,700) [1800,1900) [1000,1100) [700,800) [1300,1400) [1500,1600) -# 430723 431613 432526 440532 440903 441021 -# [1200,1300) [1100,1200) [900,1000) [1700,1800) [800,900) -# 443252 446432 447705 457678 469386 - -# 2b. We do get the same results as in question 1. - -# 2c. We choose to use 2900 instead of (say) 2400 in this method -# because some flights departed after midnight. -# The time stamps are between 0000 and 2400 -# (this is like military time, between 00:00 and 24:00). -# Some flights have delays until after midnight, -# and they are recorded in a surprising way, -# e.g., 24:30 for 30 minutes past midnight, -# or 26:10 for 2 hours and 10 minutes past midnight. -# In our data set, it happens that all of the ranges of the times -# are between 0000 and 2900. I just checked the max to find that out. -# So that's why we use 2900 as an upper boundary, instead of 2400. - -max(myDF$DepTime, na.rm=T) - ----- - - -== Project 17 - -Please download this template and use it to submit your solutions to GitHub: - -https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw17.ipynb - -Recall the 2018 election data, available here: `/depot/statclass/data/election2018/itcont.txt` - -and the data dictionary for this data, which is available here: - -https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description - -Question 1 - -a. Use the system command in R to read the data for the first 100,000 donations and store this data into a file called: shortfile.txt (We use .txt instead of .csv because the file is not comma delimited.) - -b. Use the read.csv command to read this data into a data frame in R, called: myDF (Hint: check the help for read.csv: ?read.csv to remind yourself about the "sep" and the "header" parameters for read.csv. In particular, this data has "|" as the separator between the data elements, and it does not have a header.) - -c. Check the dimension of the resulting data frame. It should be 100,000 rows and 21 columns. - -Question 2 - -a. Split the data for these 100,000 donations according to the State from which the donation was given. Store the resulting data in a list called: myresult (Hint: Check the data dictionary for the meanings of the columns, since we do not have column headers.) (Another hint: Remember that we can refer to a column of data in a data frame by its number, for instance, myDF[[8]] is the name of the donor.) - -b. Check the names of myresult: names(myresult) We see the the first element of the list does not have a name. This is a pain! To solve this, you can give it a name, for instance, by writing: names(myresult)[1] <- "unknown" (or any other kind of name that you want, to indicate that the name is unknown) - -Question 3 - -a. Find the mean donation amount, according to each state. - -b. What is the mean donation from Hoosiers (i.e., for people from Indiana)? - -c. Find the standard deviation of the donation amount, according to each state. - -d. Find the number of donations, according to each state. - -e. For a sanity check, make sure that the number of donations in 3d adds up to 100,000 altogether. - -Example - -[source,R] ----- -# Remember that we can make system calls from R. -# For instance, we can take the first 50000 lines of a file -# and store them into a new file called shortfile.csv -# To do this, we use the "system" command in R. -# it basically enables us to run terminal commands -# while we are still working in R. - -# This is an especially handy technique, -# because the operating system itself is much faster than R. - -system("head -n50000 /depot/statclass/data/dataexpo2009/2005.csv >shortfile.csv") - -# Now we can read this (much shorter!) file into R. - -myDF <- read.csv("shortfile.csv") - -# It has data about only 49,999 flights because the header itself -# counts as one of the 50,000 lines that we extracted. - -dim(myDF) - -# We can check to make sure that the read.csv worked, -# by examining the head of myDF: - -head(myDF) - -# Within myDF, we can break the data into pieces, -# according to (say) the Origin airport. -# The split command can easily do that for us. -# We give the split command 2 pieces of data: -# 1. The data that should be split, and -# 2. The way that the data is classified into pieces. -# So, for instance, we can split the DepDelays -# into pieces, based on the Origin. - -myresult <- split(myDF$DepDelay, myDF$Origin) - -# If we check the length of the result, it is 93: - -length(myresult) - -# because there are DepDelays from 93 airports. - -# The type of data is a "list". - -class(myresult) - -# We have not (yet) worked with lists, -# but they are a lot like data frames. -# The difference is that each column can have a different length. - -# For example, here are the first six columns -# of the list: - -head(myresult) - -# The flights to Albuquerque are found in the second column: - -myresult$ABQ - -# or we can get this data by just asking directly for the second column, -# without knowing the name of the column: - -myresult[[2]] - -# Now we can use the power of the apply functions that R provides. -# You are already familiar with the tapply function. -# Another very commonly used apply function is called "sapply". - -# We use sapply to apply a function to each part of a collection of data. - -# For example, remember that myresult has 93 parts: - -length(myresult) - -# We can take the mean of the data in each element of myresult -# by applying the function "mean" to each element, as follows: - -sapply(myresult, mean) - -# Unfortunately, many of the results are NA's, so we can use na.rm=T - -sapply(myresult, mean, na.rm=T) - -# We can apply many functions to myresult in this way. - -# For instance, here is the variance of each part of the data in myresult: - -sapply(myresult, var, na.rm=T) - -# or the standard deviation: - -sapply(myresult, sd, na.rm=T) - -# Here is the number of flights from each Origin airport: - -sapply(myresult, length) - -# If we add up the number of flights, we better get 49,999: - -sum(sapply(myresult, length)) - -# It is worthwhile to experience with sapply. -# For instance, for something fun to try, -# you can (simultaneously) make a plot of the DepDelays -# from each of the 93 airports, as follows: - -sapply(myresult, plot) - -# This runs the "plot" function on each piece of data, -# in other words, on the data from each Origin airport. - -# You can see the first 6 DepDelays from each Origin airport, as follows: - -sapply(myresult, head) - -# This is taking the "head" of each part of the data. ----- - -== Project 17 Solutions - -[source,R] ----- -# 1a. We first store the first 100,000 donations into a file -# called shortfile.txt -# using the system command - -system("head -n100000 /depot/statclass/data/election2018/itcont.txt >~/shortfile.txt") - -# 1b. Now we import this data into the read.csv file - -myDF <- read.csv("~/shortfile.txt", header=F, sep="|") - -# 1c. The resulting data frame has 100000 rows and 21 columns, as it should! - -dim(myDF) - -# 2a. Now we split the data for the donations according to the State -# from which the donation was given - -myresult <- split(myDF$V15, myDF$V10) - -# 2b. We check the names of myresult: - -names(myresult) - -# and the first element of the list does not have a name. -# so we give it a name, for instance, by writing: - -names(myresult)[1] <- "unknown" - -# 3a. The mean donation amount, from each state, -# can be found using the sapply command: - -sapply(myresult, mean, na.rm=T) - -# 3b. The mean donation from Indiana can be found -# by extracting the entry with the name "IN" - -sapply(myresult, mean, na.rm=T)["IN"] - -# and we get: IN: 367.914678899083 - -# 3c. The standard deviation of the donation amount for each state is: - -sapply(myresult, sd, na.rm=T) - -# 3d. The number of donations per state can be found by -# checking the length of the vector of donations from each state: - -sapply(myresult, length) - -# 3e. For our sanity check, we see that yes, indeed, -# the total number of donations is 100,000: - -sum(sapply(myresult, length)) - ----- - - -== Project 18 - -Here is the Project 18 template: - -https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw18.ipynb - -Consider the election data stored at: `/depot/statclass/data/election2018/itcont.txt` - -The data set is very large. You might choose to analyze a smaller portion of the data initially, and then to run your code on the full data set, once you have the code working correctly. - -Sometimes there will be warnings in Jupyter Notebooks, and you need to scroll past the warnings, to see the results of your analysis. This is a known issue with Jupyter Notebooks, and other people are experiencing it too: - -https://github.com/IRkernel/IRkernel/issues/590 - -Recall that the data dictionary for the data is found here: - -https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description - -Question 1 - -a. The first column contains the "Filer identification number" for various committees. Which of these committees received the largest monetary amount of donations? - -b. Use the tapply function to make a matrix whose rows correspond to states, whose columns correspond to the "filer identification numbers" of committees, and whose entries contain the total amount of the donations given to the committees, by donors from each individual state. (Hint: Wrap the states and the filer identification numbers into a list.) Print the block of first 10 rows and 10 columns, so that the TA's can see the results of your work. - -Question 2 - -For this question, be sure to take into account the city and state (together). - -a. Identify the six cities that made the largest number of donations. - -b. Identify the six cities that made the largest monetary amount of funding donated. - -Question 3 - -a. Split the data (using the split command) about the donations, according to the day when the transaction was made. Once this split is accomplished, use the sapply function to find the following: - -b. On which day was the total monetary amount of donations the largest? - -c. On which day was the largest number of donations made? - - -[source,R] ----- -# Examples that might help with Project 18 (but are using the airline data set) -# We can read in the 2005 flight data: -myDF <- read.csv("/depot/statclass/data/dataexpo2009/2005.csv") -# and verify that we got it read in properly, using the head: -head(myDF) -# We can find the mean DepDelays, according to the Origin and Destination (simultaneously). -# This puts the Origins on the rows and the Destinations on the columns. -tapply(myDF$DepDelay, list(myDF$Origin, myDF$Dest), mean, na.rm=T) -# If you just want to see the first 10 rows and columns, -# you can save the results to a variable: -myresult <- tapply(myDF$DepDelay, list(myDF$Origin, myDF$Dest), mean, na.rm=T) -# and then load the rows and columns that you want to see: -myresult[1:10,1:10] -# Many are NA because you can't always get from one city to another. -# You can lookup specify Origins and Destinations as follows: -myresult[c("DEN","ORD","JFK"),c("BOS","IAD","ATL")] -# Those are flights from Origin "DEN" or "ORD" or "JFK" to Destinations "BOS" or "IAD" or "ATL" -# Here is another example: -# We can split all of the data about the DepDelays, according to the date. -# To do this, I first need to make a column that contains the dates, -# since the airport data doesn't have such a column (yet): -myDF$completedates <- paste(myDF$Month, myDF$DayofMonth, myDF$Year, sep="/") -# Then we split the DepDelays, according to the dates: -mydelays <- split(myDF$DepDelay, myDF$completedates) -# This gives us a list: -class(mydelays) -# Of course the length is 365, because there are 365 days per year: -length(mydelays) -# Here are the delays from Christmas Day: -mydelays["12/25/2005"] -# Now we can easily use the sapply function on -# the DepDelay data, which has already been grouped according to the days. -# Here is the mean DepDelay on each day: -sapply(mydelays, mean, na.rm=T) -# Here is the standard deviation of the DepDelay, on each day: -sapply(mydelays, sd, na.rm=T) -# Here is the length of each piece of the data, -# i.e., the number of pieces of data per day. -# (This is obviously equal to the number of flights per day too, -# because each flights has *some kind* of delay!) -sapply(mydelays, length) -# Project 18 Solutions: -# 1a. The committee C00401224 received $565007473 in donations altogether. -tail(sort(tapply(myDF$V15, myDF$V1, sum, na.rm=T))) -# Here are the top six committees, according to the total monetary donations: -# C00000935 109336606 -# C00571703 114336858 -# C00003418 116712977 -# C00484642 130390881 -# C00504530 133582635 -# C00401224 565007473 - -# 1b. We first build a matrix with the data from the states (column 10) on the rows -# and the data from the committees (column 1) on the columns. -# Each entry have the analogous sum of the sum of the donations. -myresult <- tapply(myDF$V15, list(myDF$V10, myDF$V1), sum, na.rm=T) -# Now we display the results of the first 10 rows and 10 columns: -myresult[1:10,1:10] -# C00000059 C00000422 C00000638 C00000729 C00000885 C00000901 C00000935 C00000984 C00001016 C00001180 -# NA NA NA NA NA NA 174182 NA NA NA -# AA NA NA NA NA NA NA 15336 NA NA NA -# AE NA NA NA NA NA NA 13122 NA NA NA -# AK NA 5148 NA 4384 1985 23135 175850 NA 8674 NA -# AL NA 7152 NA 9722 1868 103106 407595 NA 13518 NA -# AP NA NA NA NA NA NA 4705 NA NA NA -# AR 420 9994 NA 5750 1406 13910 183457 5000 12730 NA -# AS NA NA NA NA NA NA NA NA NA NA -# AZ NA 10074 NA 26778 615 17040 1223310 NA 12488 NA -# CA NA 89498 NA 41752 31705 108253 28039517 5000 256676 NA - -# 2a. We paste together the city and state data using the paste function. -# Then we tabulate the number of such donations, according to these city-state pairs. -# Finally, we sort these counts and print the six largest ones, using the tail function. -tail(sort(table(paste(myDF$V9,myDF$V10)))) - -# 2b. We paste together the city and state data using the paste function. -# Then we add the monetary amount of the donations (from column 15), -# according to these city-state pairs. -# Finally, we sort these total monetary amounts and -# print the six largest ones, using the tail function. -tail(sort(tapply(myDF$V15,paste(myDF$V9,myDF$V10),sum,na.rm=T))) - -# 3a. We split the data about the donation amounts (from column 15), -# according to the day on which the donations were made. -myresult <- split(myDF$V15, myDF$V14) - -# 3b. Now we sum the monetary amount of the donations, for each day: -tail(sort(sapply(myresult, sum, na.rm=T))) - -# 3c. Alternatively, we see how many donations were made on each day, -# by finding the length of the vector that has the donations for that day, -# i.e., by finding how many donations there were for each day. -tail(sort(sapply(myresult, length))) - ----- - -== Project 19 - -There was no Project 19 - - -== Project 20 - -Please submit your answers, when you are finished, using GitHub. We put an RMarkdown file into your individual GitHub accounts, for this purpose. - -Notes about scraping data: - -As a gentle reminder about how to access RStudio: - -Log on to Scholar: - -https://desktop.scholar.rcac.purdue.edu - -(or use the ThinLinc client on your computer if you installed it!) - -open the terminal on Scholar and type: - -[source,bash] ----- -module load gcc/5.2.0 -module load rstudio -rstudio & ----- - -Please remember to install and load the XML and the RCurl libraries. - -Using RStudio, we start to learn how to extract data from the web. - -Use the data from the Billboard Hot 100 for question 1. - -Please use the data from the week you were born. For instance, if I solve question 1, I would use the data located here: - -https://www.billboard.com/charts/hot-100/1976-10-13 - -Question 1 - -On the Hot 100 chart, from the day of your birth: - -a. Extract the titles of the songs ranked #2 through #100. - -b. Extract the artists for those 99 songs. - -c. Extract the title of the number 1 song for that day. - -d. Extract the artist for the number 1 song for that day. - -Question 2 - -a. Extract the city where the National Park property for Catoctin Mountain is located. This data is found at: `https://www.nps.gov/cato/index.htm` or in the file: `/depot/statclass/data/parks/cato.htm` - -b. Extract the state where Catoctin Mountain is located. - -c. Extract the zip code where Catoctin Mountain is located. - -Question 3 - -a. Identify three potential websites that you are interested to try to scrape yourself, during the upcoming seminars. Look for websites with data that is (relatively) easy to scrape, for instance: Systematic URL’s that are easy to understand; (relative) consistency in how the data is stored; and make sure that the data is embedded in the page, rather than in csv files that are already prepared for download. (We want to actually scrape some data.) - -b. For each of the three websites that you identified, give a very brief description of the kind of data that you want to scrape. - -== Project 20 Billboard Example - -[source,R] ----- -install.packages("XML") -library(XML) - -# Considering the songs and artists who sang popular songs at the time of my birthday in 1976, we can scrape some data from Billboard Hot 100 chart - -# Here are the songs titles #2 through #100 from my birthday - -# Please notice the double underscore before title: - -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/1976-10-13")), -"//*/div[@class='chart-list-item__title']", xmlValue) - -# Here are the artists of the songs #2 through #100 from my birthday - -# Please notice the double underscore before artist: - -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/1976-10-13")), -"//*/div[@class='chart-list-item__artist']", xmlValue) ----- - -== Project 20 National Park Service Example - -[source,R] ----- - -# We will use the XML package to parse html (or XML) data -install.packages("XML") -library(XML) - -# and the RCurl package if you want to pull the data directly from the web: -install.packages("RCurl") -library(RCurl) - -# To see the list of the parks, we can go here: -# https://www.nps.gov/findapark/index.htm -# if you use Control-U -# (i.e., the Control Key and the letter U Key at once) -# then you can see the code -# for the way that the webpage is created. - -# You can use Firefox to open any of the files -# with the data from the state parks; -# they are all found inside this directory: -# /depot/statclass/data/parks/ - -######################################## -# To study a specific park, -# we look at the source for the Abraham Lincoln Birthplace: -# https://www.nps.gov/abli/index.htm -# We load that webpage in a browser and then type Control-U - -# You search the code in a page with Control-F in Firefox - -# Here is the name of the Abraham Lincoln Birthplace: -xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")), "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -# Here is the street address: -xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")), "//*/span[@itemprop='streetAddress']", xmlValue) - -# Alternatively, we can also do this with the file itself, -# instead of pulling the data from the web: - -xpathSApply(htmlParse("/depot/statclass/data/parks/abli.htm"), "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -xpathSApply(htmlParse("/depot/statclass/data/parks/abli.htm"), "//*/span[@itemprop='streetAddress']", xmlValue) ----- - -== Project 20 Answers - -[source,R] ----- - -install.packages("XML") -library(XML) - -# 1a. here are the song titles, 2 through 100, from (for instance) January 20, 1990: -# but students should use their OWN BIRTHDAYS for this question. -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")), -"//*/div[@class='chart-list-item__title']", xmlValue) - -# 1b. here are the artists of the songs 2 through 100: -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")), - "//*/div[@class='chart-list-item__artist']", xmlValue) - -# 1c. here is the title of the number 1 song from that week: -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")), - "//*/div[@class='chart-number-one__title']", xmlValue) - -# 1d. here is the artist for the number 1 song from that week: -xpathSApply(htmlParse(getURL("https://www.billboard.com/charts/hot-100/2000-01-20")), - "//*/div[@class='chart-number-one__artist']", xmlValue) - -# 2a. Here is the city: -xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")), - "//*/span[@itemprop='addressLocality']", xmlValue) -# alternatively: -xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"), - "//*/span[@itemprop='addressLocality']", xmlValue) - -# 2b. Here is the state: -xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")), - "//*/span[@itemprop='addressRegion']", xmlValue) -# alternatively: -xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"), - "//*/span[@itemprop='addressRegion']", xmlValue) - -# 2c. Here is the zip: -xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")), - "//*/span[@itemprop='postalCode']", xmlValue) -# alternatively: -xpathSApply(htmlParse("/depot/statclass/data/parks/cato.htm"), - "//*/span[@itemprop='postalCode']", xmlValue) - -# 3a, 3b answers will vary - ----- - - -== Project 21 - - -Please use this template to submit Project 21: - -https://raw.githubusercontent.com/TheDataMine/STAT-19000/master/Assignments/hw21.Rmd - -This project is supposed to be an easy modification of the project example, -since it is almost time for Spring Break! - -1. Modify the NPS example to extract the city location for every National Park. - -2. Same question, for the state location for every National Park. - -3. Same question, for the zip code for every National Park. - -Note: Do not worry if some of the results have extra spaces. We can deal with that later! - - - -== Project 21 Example: - -[source,R] ----- - -library(RCurl) -library(XML) - -# The webpage for the National Park Service includes -# only a little information about every NPS property: -# https://www.nps.gov/findapark/index.htm -# Importantly, it has the 4-letter codes for each property. - -# If you type Control-U, then you can see the source for the page. -# Scroll down, and you will see on -# lines 210 through 753 these 4-letter codes -# (It might not be exactly lines 210 through 753 because the NPS -# modifies its webpages, just like any organization does!) - -# Each such NPS property has the 4-letter code as -# an attribute to one of the XML tags. They are all found inside -# of a "select" tag, -# and then inside an "optgroup" tag, -# and then inside an "option" tag. -# You extract this XML value using the xmlGetAttr, like this: - -myparks <- xpathSApply(htmlParse(getURL("https://www.nps.gov/findapark/index.htm")), "//*/div/select/optgroup/option", xmlGetAttr, "value") - -# and then we see the full listing of all 497 of these 4-digit codes here: - -myparks - -# Last week, we already learned how to extract the street address of a park. -# For instance, this is the name of the Abraham Lincoln Birthplace: - -xpathSApply(htmlParse(getURL("https://www.nps.gov/abli/index.htm")), - "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -# Similarly, this is the name of Catoctin Mountain: - -xpathSApply(htmlParse(getURL("https://www.nps.gov/cato/index.htm")), - "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -# Here's the name of the Great Smoky Mountains; -# we just change "abli" or "cato" to "grsm" and we have it! - -xpathSApply(htmlParse(getURL("https://www.nps.gov/grsm/index.htm")), - "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -# In general, we could paste in the 4-digit letter of the park, like this: - -x <- "abli" -xpathSApply(htmlParse(getURL( paste0("https://www.nps.gov/", x, "/index.htm"))), - "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue) - -# where the value of "x" is the park's 4-digit abbreviation. -# Let's try to get these two park names simultaneously now. - -# We build a function to do so: - -mynameextractor <- function(x) {xpathSApply(htmlParse(getURL( paste0("https://www.nps.gov/", x, "/index.htm"))), - "//*/div[@id='HeroBanner']/div/div/div/a", xmlValue)} - -# and then we apply it to each of these 4-letter codes: - -sapply( c("abli", "cato", "grsm"), mynameextractor ) - -# One thing about scraping data from the web is that -# there are always "hiccups" in the process, -# i.e., there are always challenges. -# For instance, we have codes for "cbpo" and "foca" -# but those pages do not actually exist (yet). -# So we need to remove them from our list of 4-letter codes: - -mygoodparks <- myparks[(myparks != "cbpo")&(myparks != "foca")] - -# Now we are ready to apply our function to -# all the NPS properties. We do it first to the "head", -# just to make sure things are working: - -myresults <- sapply( head(mygoodparks), mynameextractor ) - -myresults - -# and if this worked, then we apply it to the full list of parks. -# P.S. Depending on your web connection, and how many -# students do this at one time, you might need to run -# this a few times. It did not work quite right for me -# on the first try, but that is the nature of websites, -# i.e., sometimes there are failures and/or service interruptions, -# but it should generally work in just a few minutes! - -myresults <- sapply( mygoodparks, mynameextractor ) - -# Finally, here are the names of all the park properties: - -myresults - - ----- - -== Project 22 - -Here is an *optional* Project 22. You don't need to do it, but if you choose to do it, we will count it as a replacement for your lowest previous project grade. - -In this folder on Scholar: `/depot/statclass/data/examples` there is a program called "challenge", so you can run it by typing in the terminal something like this: `/depot/statclass/data/examples/challenge 111` - -Here is the goal: You can try to make a program (in any language) that converts strings of digits to strings of letters, by substituting - -[source,bash] ----- -1 -> a -2 -> b -3 -> c -...... -26 -> z ----- - -Please notice that we do *not* say - -[source,bash] ----- -01 -> a ----- - -but rather, we say - -[source,bash] ----- -1 -> a ----- - -The program should print the number of ways to do this. - -So, for instance, if you type: - -`/depot/statclass/data/examples/challenge 111` - -It will return the number 3 because there are exactly 3 ways to decode the string 111, namely: - -[source,bash] ----- -ak -ka -aaa ----- - -Makes sense? Here is another example: - -`/depot/statclass/data/examples/challenge 15114` - -will return the number 6 because there are exactly 6 ways to decode the string 15114, namely: - -[source,bash] ----- -aeaad -aean -aekd -oaad -oan -okd ----- - -The challenge (again, only for bonus credit) is to write a program that will produce the same results as the program that I gave you. You are welcome to use any programming language. - -== Project 23 - -Here is the project. We will build on this project in the upcoming work that we will do in April. - -Recall that in Project 20, question 3ab, you identified some websites that you were interested to scrape. Pick only 1 of the websites that is of interest to you, and scrape at least 5 pieces of information from a few pages within that website. (I am being a little nebulous here, because I want you to have the freedom to explore!) For instance, you could pick IMDB as the website and scrape 3 pieces of information from 5 different movies. BUT you can pick any website. It does *NOT* need to be the IMDB website. You can do any website you like. That's the entire assignment for this week! If you are not able to do it for the sites that you mentioned in Project 20, then you can (instead) identify a different website to scrape. - -[source,R] ----- -# We recall that we can scrape information (which is stored in XML format) from the internet, using XPath. -# Remember that we load Scholar in the web interface and open a browser and use Control-U to see the XML code. -# Inside R, we first load the XML library and the RCurl library: - -library(XML) -library(RCurl) - -# Then we just download the webpage and we put the path to the desired web content into the notation of XPath. - -# We already gave some examples of how to scrape XML data from the web, -# back in Project 20 and Project 21. Please feel welcome to read those again and remind yourself. - -# Here are a few more examples to inspire you, about how to scrape and parse some XML code from the internet. - -##################################################### -# Example: IMDB (Internet Movie Database) -# We can scrape information about movies. For instance, IMDB is a popular movie website. -# The information about the movie Say Anything is given here: https://www.imdb.com/title/tt0098258/ -# This is Dr Ward's favorite movie, by the way! -# Here is the title and year, which are stored together in the same place. -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/div[@class='title_wrapper']/h1", xmlValue) - - -# In this XML, if you only want the year 1989 in which the movie was made, -# but do not care about the title, then just go deeper, by -# also including the "span" and "a" tags too: - -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/div[@class='title_wrapper']/h1/span/a", xmlValue) - -# Here is a completely different place in the XML to find the title: - -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/div/div[@id='ratingWidget']/p/strong", xmlValue) - - - -# Here is the specific release date: - -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/a[@title='See more release dates']", xmlValue) - - - -# We can try to extract the Director and the Writer. -# Cameron Crowe was both the Director and the Writer. -# If we do the following search, we get 3 results: -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/div[@class='credit_summary_item']", xmlValue) -# So we could save this information in a vector, and just extract the first and second elements -v <- xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/div[@class='credit_summary_item']", xmlValue) -# Now we have the DIrector: -v[1] -# and the Writer: -v[2] -# in separate elements. -# There are other ways to do this, once we get more comfortable with XML -# but this is a good start! - - - -# The title is stored lots and lots of places in the webpage. -# It is also sometimes stored in an XML tag itself, rather than in the content of the page. -# For instance, search for the phrase: -# og:title -# in the code in your browser to see this. - -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/meta[@property='og:title']", xmlGetAttr, "content") - - -# Here's another way, which is just 2 lines later, in the source code for the page. -# We just change "property" to "name" -# and we change "og:title" to "title" and we get the title and year again: - -xpathSApply(htmlParse(getURL("https://www.imdb.com/title/tt0098258/")), - "//*/meta[@name='title']", xmlGetAttr, "content") - -# These are just meant to be illustrative examples to try to help! Have fun! Explore! - - ----- - - - -== Project 24 - -We build on Project 23 as follows: - -[source,R] ----- -# In Project 23, we scraped a few elements of data from a website. -################################################## -# Wrap your code from Project 23 into a function, and then -# scrape at least 100,000 pieces of data from any website: your choice! -################################################## -# Here is an example of how to get started: -# First we load the needed libraries: -library(XML) -library(RCurl) -# Then we wrap our code into a function. -mytitlefunc <- function(x) { - xpathSApply(htmlParse(getURL(paste0("https://www.imdb.com/title/tt", x, "/"))), - "//*/div[@class='title_wrapper']/h1", xmlValue) -} - -# Notice that we replaced the website: -# https://www.imdb.com/title/tt0098258/ -# with (instead) some code to build the website as we go: -# paste0("https://www.imdb.com/title/tt", x, "/" -# This uses the value x as the number of the movie. -# Now we can run our function and extract the results for a movie: -mytitlefunc("0098258") -# Our function is vectorized, i.e., we can run it on a vector, -# and it will return the results for each individual movie, -# for instance: -mytitlefunc(c("0110000", "0110001", "0110002", "0110003")) -# We could try to run it on a sequence of numbers, but -# this will not quite work at first. -# For instance, if we try to run it on this sequence: -110000:110003 -# We see that these numbers are only 6 digits, but the URL expects to -# have a total of 7 digits to work. -# So we can use the string print function, -# which is also available in other languages too: -sprintf("%07d", 110000:110003) -# Here we have the "%" which means we are printing a variable, -# and the "0" means we should pad things with leading zeroes if needed, -# and the "7" means that we want 7 digits, and the "d" means digits. -# Now it will work on this input -mytitlefunc(sprintf("%07d", 110000:110003)) -# and we can even change this to (say) 100 pages at a time: -mytitlefunc(sprintf("%07d", 110000:110100)) ----- - -== Hint from Luke Francisco: - -Based on my experience with students during my office hours last week, I thought I would share some things to consider when you are looking for data to use on project 24 if you have not done so already. I should have posted this earlier, but it just now dawned on me that this would make a good Piazza post. - -When you are scraping large amounts of data from the web, you want to focus on replicability. If you look at Dr. Ward's previous two examples with the national parks data and the Billboard music data, you will see what I am talking about. There are several websites in each case (one webpage for each national park and one webpage for the Billboard top songs for each week). If you want to scrape all of this data you will need to give R all of the URL's in order to go find the data. What makes these two examples easy are that the webpages all have the exact same URL's except for one part. For example, the Billboard top songs all have the same URL except for a different date inserted. This allows you to make a vector of dates and insert each date into the URL. If this were not the case and the URL's were totally different for each week, you would have to find the URL for every week since 1980, which would be extremely laborious!!!!! - -You also want to make sure that each website is formatted similarly. Consider the addresses of the national parks - they were all found at the bottom of the page for the corresponding national park with the same HTML formatting. In the case of the Billboard data, the website for every week has the top songs entered in the exact same format - the only things that change are the song titles and artists, which are the data we are interested in. This means your code that pulls the songs on the Billboard charts from this week will also pull the songs on the Blackboard charts from 1980!!!!!! - -Consider this example I used in my office hours last week. Ken Pomeroy publishes statistics for all 353 men's division 1 college basketball teams. The data for the most recent season can be found at this link: - -https://kenpom.com/index.php?y=2019 - -If you look at the HTML code of the website and CTRL+F search for Purdue, you will notice that the data for each team, which is the data in each row of the table on the website, is entered with the exact same format. Better yet, change the last four digits of the URL to 2018 and CTRL+F for Purdue again in the HTML code. The data for the 2018 season for every team is also entered in the exact same format. Even better is that you can change the year at the end of the URL to any year after 2002 and you will find a wealth of similarly formatted data. This meets the two criteria: 1.) URL's containing data have extremely similar formats and 2.) Each webpage has identical HTML formatting. This is the type of data you should look for when finding data for your project as it will make your life a whole lot easier! - -Also keep in mind that you want to get data that you can analyze in some way for project 25! - -Sorry for being long, but I hope this was helpful and inspiring. Remember to work smarter, not harder! - -== Project 25 - -Question 1 - -a. Store the 100,000 pieces of data that you scraped in Project 24 into a data frame. - -b. Save that data frame in an xlsx file, for instance, using the write.xlsx function from the library "xlsx". - -Question 2 - -2a,2b,2c. Make 3 questions about the data that you assembled in Project 24. - -Question 3 - -3a,3b,3c. Answer the 3 questions from 2a,2b,2c by making 3 visualizations from the data that you assembled. Be sure to use best practices for data visualization. - -Refer to the selections from the texts: - -The Elements of Graphing Data by William S. Cleveland - -and Creating More Effective Graphs by Naomi B. Robbins - -These selections are archived online here: - -http://llc.stat.purdue.edu/ElementsOfGraphingData.pdf - -http://llc.stat.purdue.edu/CreatingMoreEffectiveGraphs.pdf - -Submit your project in RMarkdown. Please be sure to submit the .Rmd file and also the .xlsx file created in 1b. Of course the graders will be unable to run your code for 1a, because they do not want to scrape all of the data that you scraped. Instead, the graders want to use the data from question 1b, so be sure to submit the .Rmd file and the .xlsx file too. - -== Optional Project 1 - -Remind yourself how to run SQL queries in R, for instance, using the examples from Project 8. - -Question 1 - -Find the largest number of home runs (by an individual batter) each year. - -For instance: - -in 2014 a player hit 40 HR's, - -in 2015 a player hit 47 HR's, - -in 2016 a player hit 47 HR's, - -in 2017 a player hit 59 HR's, and - -in 2018 a player hit 48 HR's. - -(Yes, I have updated the data to include 2018!!) - -Question 2 - -Make a plot that shows this largest number of home runs per year (not just these 5 years, but the annual records back to 1871). - -Question 3 - -Create a question about baseball that you are interested in, and use a SQL query in R to answer the question. Put all of your R code into an RMarkdown file, and give some comments about your code, to explain your method of solution. Submit the RMarkdown (.Rmd) file, and also a pdf file the shows the output (including the code, your explanation, the picture from question 2 that displays the plot, etc.). - -== Optional Project 2 - -Recall how we can work with very large data sets (which are too large to import into R), by using UNIX. We did this in some of the earliest problem sets in STAT 19000, during the fall semester. - -Question 1 - -a. How many taxi cab rides occurred (altogether) during 2015? Do not give a breakdown by month. Give the total number of taxi cab rides for the full year 2015. (Hint: Remember to be careful about the headers at the top of each file.) - -b. Give the distribution of the number of passengers in the taxi cab rides throughout (all months of) the year 2015. Do not give a breakdown by month. Give the distribution across the full year 2015. - -Question 2 - -a. Across all years of the airline data, how many flights occurred on each airline? Which airline is the most popular overall, in terms of the number of flights? - -b. Across all years of the airline data, which flight path is the most popular? How many airplane trips occurred on that flight path? - -Question 3 - -Create a question about taxi cab rides or airline flights that you are interested in, and use UNIX to answer the question. - -Put all of your UNIX code into plain text file, and give some comments about your code, to explain your method of solution. Submit the plain text (.txt) file with your code (including your explanations). - -== Optional Project 3 - -Use R to analyze the election data from the 2018 election. Remember to use read.csv to read in the data, and use header=F (since there is no header) and use sep="|" since this symbol separates the data. - -Question 1 - -a. Identify the top 20 employers that donated the most amount of money (altogether). Some of these entries will be strange, e.g., blank entries, NA, self employed, etc. That is OK! - -b. Plot the largest 20 total amounts (from the 20 employers) on a dotchart, in order from largest (at the top) to smallest (at the bottom). - -Question 2 - -a. In which city/state is the average donation amount the largest? (Treat the city and state data together as a pair.) - -b. How many donations were given from this city/state pair? How large were the total amount of donations from this city/state pair? - -Question 3 - -Create a question about the 2018 election data that you are interested in, and use R to answer the question. - -Put all of your R code into an RMarkdown file, and give some comments about your code, to explain your method of solution. Submit the RMarkdown (.Rmd) file, and also a pdf file the shows the output (including the code, your explanation, the picture from question 1b that displays the plot, etc.). - -== Optional Project 4 - -Please submit your project in RMarkdown. - -Read the selection of The Elements of Graphing Data by William Cleveland, and the selection of Creating More Effective Graphs by Naomi Robbins. - -Also read the classic article "How to Display Data Badly" by Howard Wainer: - -http://www.jstor.org.ezproxy.lib.purdue.edu/stable/2683253 - -We referred to both of these in Project 25. - -Question 1 - -a. Find 3 visualizations from the Information Is Beautiful website (http://www.informationisbeautiful.net/) that do a BAD job of portraying data, according to the best practices in the selections mentioned above. Write 1/3 of a page (for each such visualization) about what is done poorly, i.e., write 1 single-spaced page total. - -b. Identify 3 excellent visualizations of data from the Information Is Beautiful website. Write 1/3 of a page (for each such visualization) about what is done well, i.e., write 1 single-spaced page total. - -Question 2 - -Consider the poster winner "Congestion in the Sky", from the 2009 Data Expo: http://stat-computing.org/dataexpo/2009/posters/ - -a. Describe at least 3 significant ways that this poster could be improved. For each of these 3 ways, write a 1/3 of a page constructive criticism, specifying what could be improved and how that aspect of the visualization could be done better, i.e., write 1 single-spaced page total. - -b. Which of the posters in the Data Expo 2009 do you think should be the winner? Why? (It is OK if you choose the poster that actually won, or any of the other posters.) Thoroughly justify your answer, using the techniques of effective data visualization, to justify your answer (write 1 single-spaced page total). - -(This entire assignment is 4 single-spaced pages.) - diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc deleted file mode 100644 index 1d69e1b48..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project01.adoc +++ /dev/null @@ -1,169 +0,0 @@ -= STAT 19000: Project 1 -- Fall 2020 - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called RStudio, we will take some time to learn how to connect to it, configure it, and run code. - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** r, rstudio, Scholar - -.Learning Objectives -**** -- Use Jupyter Notebook to run Python code and create Markdown text. -- Use RStudio to run Python code and compile your final PDF. -- Gain exposure to Python control flow and reading external data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/open_food_facts/openfoodfacts.tsv` - -== Questions - -=== Question 1 - -Navigate to https://notebook.scholar.rcac.purdue.edu/ and sign in with your Purdue credentials (_without_ BoilerKey). This is an instance of Jupyter Notebook. The main screen will show a series of files and folders that are in your `$HOME` directory. Create a new notebook by clicking on menu:New[f2020-s2021]. - -Change the name of your notebook to "LASTNAME_FIRSTNAME_project01" where "LASTNAME" is your family name, and "FIRSTNAME" is your given name. Try to export your notebook (using the menu:[File] dropdown menu, choosing the option menu:[Download as]), what format options (for example, `.pdf`) are available to you? - -[NOTE] -`f2020-s2021` is the name of our course notebook kernel. A notebook kernel is an engine that runs code in a notebook. ipython kernels run Python code. `f2020-s2021` is an ipython kernel that we've created for our course Python environment, which contains a variety of compatible, pre-installed packages for you to use. When you select `f2020-s2021` as your kernel, all of the packages in our course environment are automatically made available to you. - -https://mediaspace.itap.purdue.edu/id/1_4g2lwx5g[Click here for video] - -.Items to submit -==== -- A list of export format options. -==== - -=== Question 2 - -Each "box" in a Jupyter Notebook is called a _cell_. There are two primary types of cells: code, and markdown. By default, a cell will be a code cell. Place the following Python code inside the first cell, and run the cell. What is the output? - -[source,python] ----- -from thedatamine import hello_datamine -hello_datamine() ----- - -[TIP] -You can run the code in the currently selected cell by using the GUI (the buttons), as well as by pressing kbd:[Ctrl+Enter] or kbd:[Ctrl+Return]. - -.Items to submit -==== -- Output from running the provided code. -==== - -=== Question 3 - -Jupyter Notebooks allow you to easily pull up documentation, similar to `?function` in R. To do so, use the `help` function, like this: `help(my_function)`. What is the output from running the help function on `hello_datamine`? Can you modify the code from question (2) to print a customized message? Create a new _markdown_ cell and explain what you did to the code from question (2) to make the message customized. - -[IMPORTANT] -==== -Some Jupyter-only methods to do this are: - -- Click on the function of interest and type `Shift+Tab` or `Shift+Tab+Tab`. -- Run `function?`, for example, `print?`. -==== - -[IMPORTANT] -You can also see the source code of a function in a Jupyter Notebook by typing `function??`, for example, `print??`. - -.Items to submit -==== -- Output from running the `help` function on `hello_datamine`. -- Modified code from question (2) that prints a customized message. -==== - -=== Question 4 - -At this point in time, you've now got the basics of running Python code in Jupyter Notebooks. There is really not a whole lot more to it. For this class, however, we will continue to create RMarkdown documents in addition to the compiled PDFs. You are welcome to use Jupyter Notebooks for personal projects or for testing things out, however, we will still require an RMarkdown file (.Rmd), PDF (generated from the RMarkdown file), and .py file (containing your python code). For example, please move your solutions from Questions 1, 2, 3 from Jupyter Notebooks over to RMarkdown (we discuss RMarkdown below). Let's learn how to run Python code chunks in RMarkdown. - -Sign in to https://rstudio.scholar.rcac.purdue.edu (_with_ BoilerKey). Projects in The Data Mine should all be submitted using our template found https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/project_template.Rmd[here] or on Scholar (`/class/datamine/apps/templates/project_template.Rmd`). - -Open the project template and save it into your home directory, in a new RMarkdown file named `project01.Rmd`. Prior to running any Python code, run `datamine_py()` in the R console, just like you did at the beginning of every project from the first semester. - -Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the _engine_ in the curly braces "{" and "}". As you can see, it is possible to mix and match different languages just by changing the engine. Move the solutions for questions 1-3 to your `project01.Rmd`. Make sure to place all Python code in `python` code chunks. Run the `python` code chunks to ensure you get the same results as you got when running the Python code in a Jupyter Notebook. - -[NOTE] -Make sure to run `datamine_py()` in the R console prior to attempting to run any Python code. - -[TIP] -The end result of the `project01.Rmd` should look _similar_ to https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/example02.Rmd[this]. - -https://mediaspace.itap.purdue.edu/id/1_nhkygxg9[Click here for video] - -https://mediaspace.itap.purdue.edu/id/1_tdz3wmim[Click here for video] - -.Items to submit -==== -- `project01.Rmd` with the solutions from questions 1-3 (including any Python code in `python` code chunks). -==== - -=== Question 5 - -It is not a Data Mine project without data! [Here] are some examples of reading in data line by line using the `csv` package. How many columns are in the following dataset: `/class/datamine/data/open_food_facts/openfoodfacts.tsv`? Print the first row, the number of columns, and then exit the loop after the first iteration using the `break` keyword. - -[TIP] -You can get the number of elements in a list by using the `len` method. For example: `len(my_list)`. - -[TIP] -You can use the `break` keyword to exit a loop. As soon as `break` is executed, the loop is exited and the code immediately following the loop is run. - -[source,python] ----- -for my_row in my_csv_reader: - print(my_row) - break -print("Exited loop as soon as 'break' was run.") ----- - -[TIP] -`'\t'` represents a tab in Python. - -https://mediaspace.itap.purdue.edu/id/1_ck74xlzq[Click here for video] - -[IMPORTANT] -If you get a Dtype warning, feel free to just ignore it. - -Relevant topics:* [for loops], [break], [print] - -.Items to submit -==== -- Python code used to solve this problem. -- The first row printed, and the number of columns printed. -==== - -=== Question 6 (optional) - -Unlike in R, where many of the tools you need are built-in (`read.csv`, data.frames, etc.), in Python, you will need to rely on packages like `numpy` and `pandas` to do the bulk of your data science work. - -In R it would be really easy to find the mean of the 151st column, `caffeine_100g`: - -[source,r] ----- -myDF <- read.csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t", quote="") -mean(myDF$caffeine_100g, na.rm=T) # 2.075503 ----- - -If you were to try to modify our loop from question (5) to do the same thing, you will run into a myriad of issues, just to try and get the mean of a column. Luckily, it is easy to do using `pandas`: - -[source,python] ----- -import pandas as pd -myDF = pd.read_csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t") -myDF["caffeine_100g"].mean() # 2.0755028571428573 ----- - -Take a look at some of the methods you can perform using pandas https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats[here]. Perform an interesting calculation in R, and replicate your work using `pandas`. Which did you prefer, Python or R? - -https://mediaspace.itap.purdue.edu/id/1_ybx1iukd[Click here for video] - -.Items to submit -==== -- R code used to solve the problem. -- Python code used to solve the problem. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc deleted file mode 100644 index f37e3bca0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project02.adoc +++ /dev/null @@ -1,142 +0,0 @@ -= STAT 19000: Project 2 -- Fall 2020 - -*Introduction to R using 84.51 examples* - -++++ - -++++ - -*Introduction to R using NYC Yellow Taxi Cab examples* - -++++ - -++++ - -**Motivation:** The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn the basics while solving data-driven problems. - -**Context:** Last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within RStudio to solve problems. Soon you will see how powerful R is and why it is often a more effective tool to use than spreadsheets. - -**Scope:** r, vectors, indexing, recycling - -.Learning Objectives -**** -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Explain and demonstrate: positional, named, and logical indexing. -- Read and write basic (csv) data. -- Explain what "recycling" is in R and predict behavior of provided statements. -- Identify good and bad aspects of simple plots. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/disney/metadata.csv` - -== Questions - -=== Question 1 - -Use the `read.csv` function to load `/class/datamine/data/disney/metadata`.csvinto a `data.frame` called `myDF`. Note that `read.csv` _by default_ loads data into a `data.frame`. (We will learn more about the idea of a `data.frame`, but for now, just think of it like a spreadsheet, in which each column has the same type of data.) Print the first few rows of `myDF` using the `head` function (as in Project 1, Question 7). - -.Items to submit -==== -- R code used to solve the problem in an R code chunk. -==== - -=== Question 2 - -We've provided you with R code below that will extract the column `WDWMAXTEMP` of `myDF` into a vector. What is the 1st value in the vector? What is the 50th value in the vector? What type of data is in the vector? (For this last question, use the `typeof` function to find the type of data.) - -[source,r] ----- -our_vec <- myDF$WDWMAXTEMP ----- - -.Items to submit -==== -- R code used to solve the problem in an R code chunk. -- The values of the first, and 50th element in the vector. -- The type of data in the vector (using the `typeof` function). -==== - -=== Question 3 - -Use the head function to create a vector called `first50` that contains the first 50 values of the vector `our_vec`. Use the tail function to create a vector called `last50` that contains the last 50 values of the vector `our_vec`. - -You can access many elements in a vector at the same time. To demonstrate this, create a vector called `mymix` that contain the sum of each element of `first50` being added to the analogous element of `last50`. - -.Items to submit -==== -- R code used to solve this problem. -- The contents of each of the three vectors. -==== - -=== Question 4 - -In (3), we were able to rapidly add values together from two different vectors. Both vectors were the same size, hence, it was obvious which elements in each vector were added together. - -Create a new vector called `hot` which contains only the values of `myDF$WDWMAXTEMP` which are greater than or equal to 80 (our vector contains max temperatures for days at Disney World). How many elements are in `hot`? - -Calculate the sum of `hot` and `first50`. Do we get a warning? Read https://excelkingdom.blogspot.com/2018/01/what-recycling-of-vector-elements-in-r.html[this] and then explain what is going on. - -.Items to submit -==== -- R code used to solve this problem. -- 1-2 sentences explaining what is happening when we are adding two vectors of different lengths. -==== - -=== Question 5 - -Plot the `WDWMAXTEMP` vector from `myDF`. - -.Items to submit -==== -- R code used to solve this problem. -- Plot of the `WDWMAXTEMP` vector from `myDF`. -==== - -=== Question 6 - -The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called `ggplot`. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any). - -[source,r] ----- -dat <- table(myDF$SEASON) -dotchart(dat, main="Seasons", xlab="Number of Days in Each Season") ----- - -image:stat19000project2figure1.png["A plot resembling an abacus, where holiday is listed in a vertical list and the corresponding number of days are on that horizontal line, the further right indicating more days for that holiday classification. The clear winner is Spring.", loading=lazy] - -[source,r] ----- -dat <- tapply(myDF$WDWMEANTEMP, myDF$DAYOFYEAR, mean, na.rm=T) -seasons <- tapply(myDF$SEASON, myDF$DAYOFYEAR, function(x) unique(x)[1]) -pal <- c("#4E79A7", "#F28E2B", "#A0CBE8", "#FFBE7D", "#59A14F", "#8CD17D", "#B6992D", "#F1CE63", "#499894", "#86BCB6", "#E15759", "#FF9D9A", "#79706E", "#BAB0AC", "#1170aa", "#B07AA1") -colors <- factor(seasons) -levels(colors) <- pal -par(oma=c(7,0,0,0), xpd=NA) -barplot(dat, main="Average Temperature", xlab="Jan 1 (Day 0) - Dec 31 (Day 365)", ylab="Degrees in Fahrenheit", col=as.factor(colors), border = NA, space=0) -legend(0, -30, legend=levels(factor(seasons)), lwd=5, col=pal, ncol=3, cex=0.8, box.col=NA) ----- - -image:stat19000project2figure2.png["A filled line plot with colors corresponding to the predominant holiday at the time.", loading=lazy] - -[source,r] ----- -library(ggplot2) -library(tidyverse) -summary_temperatures <- myDF %>% - select(MONTHOFYEAR,WDWMAXTEMP:WDWMEANTEMP) %>% - group_by(MONTHOFYEAR) %>% - summarise_all(mean, na.rm=T) -ggplot(summary_temperatures, aes(x=MONTHOFYEAR)) + - geom_ribbon(aes(ymin = WDWMINTEMP, ymax = WDWMAXTEMP), fill = "#ceb888", alpha=.5) + - geom_line(aes(y = WDWMEANTEMP), col="#5D8AA8") + - geom_point(aes(y = WDWMEANTEMP), pch=21,fill = "#5D8AA8", size=2) + - theme_classic() + - labs(x = 'Month', y = 'Temperature', title = 'Average temperature range' ) + - scale_x_continuous(breaks=1:12, labels=month.abb) ----- - -image:stat19000project2figure3.png["Line plot of temperatures over months including the range. Displays a very clear arch, highs in July-August at an average of 82 degrees Fahrenheit and lows at an average of 52 degrees Fahrenheit in January.", loading=lazy] diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc deleted file mode 100644 index 329ed3e4b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project03.adoc +++ /dev/null @@ -1,125 +0,0 @@ -= STAT 19000: Project 3 -- Fall 2020 - -**Motivation:** `data.frame`s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`. - -**Context:** In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s. - -**Scope:** r, data.frames, recycling, factors - -.Learning Objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/disney` - -== Questions - -=== Question 1 - -Read the dataset `/class/datamine/data/disney/splash_mountain.csv` into a data.frame called `splash_mountain`. How many columns, or features are in each dataset? How many rows or observations? - -.Items to submit -==== -- R code used to solve the problem. -- How many columns or features in each dataset? -==== - -=== Question 2 - -Splash Mountain is a fan favorite ride at Disney World's Magic Kingdom theme park. `splash_mountain` contains a series of dates and datetimes. For each datetime, `splash_mountain` contains a posted minimum wait time, `SPOSTMIN`, and an actual minimum wait time, `SACTMIN`. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that `SPOSTMIN` represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.) - -[TIP] -==== -If you got `NA` or `NaN` as a result, see xref:programming-languages:R:mean.adoc[here]. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining why or why not the results make sense. -==== - -=== Question 3 - -In (2), we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file `/class/datamine/data/disney/touringplans_data_dictionary.xlsx`, you will find that -999 is used as a value in `SPOSTMIN` and `SACTMIN` to indicate the ride as being closed. Recalculate the mean and standard deviation of `SPOSTMIN`, excluding values that are -999. Does this seem to have fixed our problem? - -.Items to submit -==== -- R code used to solve this problem. -- The result of running the R code. -- A statement indicating whether or not the value look reasonable now. -==== - -=== Question 4 - -`SPOSTMIN` and `SACTMIN` aren't the greatest feature/column names. An outsider looking at the data.frame wouldn't be able to immediately get the gist of what they represent. Change `SPOSTMIN` to `posted_min_wait_time` and `SACTMIN` to `actual_wait_time`. - -**Hint:** You can always use hard-coded integers to change names manually, however, if you use `which`, you can get the index of the column name that you would like to change. For data.frames like `splash_mountain`, this is a lot more efficient than manually counting which column is the one with a certain name. - -.Items to submit -==== -- R code used to solve the problem. -- The output from executing `names(splash_mountain)` or `colnames(splash_mountain)`. -==== - -=== Question 5 - -Use the `cut` function to create a new vector called `quarter` that breaks the `date` column up by quarter. Use the `labels` argument in the `factor` function to label the quarters "q1", "q2", ..., "qX" where `X` is the last quarter. Add `quarter` as a column named `quarter` in `splash_mountain`. How many quarters are there? - -[TIP] -==== -If you have 2 years of data, this will result in 8 quarters: "q1", ..., "q8". -==== - -[TIP] -==== -We can generate sequential data using `seq` and `paste0`: - -[source,r] ----- -paste0("item", seq(1, 5)) ----- - -or - -[source,r] ----- -paste0("item", 1:5) ----- -==== - -.Items to submit -==== -- R code used to solve the problem. -- The `head` and `tail` of `splash_mountain`. -- The number of quarters in the new `quarter` column. -==== - -Question 5 is intended to be a little more challenging, so we worked through the _exact_ same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the `read.csv` file) and also the format of the date. - -This basically steps you through _everything_ in Question 5. - -We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own--because we are just getting started--but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead. - -++++ - -++++ - -++++ - -++++ - -=== Question 6 - -Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan. diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc deleted file mode 100644 index 4ba1177d9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project04.adoc +++ /dev/null @@ -1,168 +0,0 @@ -= STAT 19000: Project 4 -- Fall 2020 - -**Motivation:** Control flow is (roughly) the order in which instructions are executed. We can execute certain tasks or code _if_ certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops. - -**Context:** We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R. - -**Scope:** r, data.frames, recycling, factors, if/else, for - -.Learning objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/disney` - -== Questions - -=== Question 1 - -Use `read.csv` to read in the `/class/datamine/data/disney/splash_mountain.csv` data into a `data.frame` called `splash_mountain`. In the previous project we calculated the mean and standard deviation of the `SPOSTMIN` (posted minimum wait time). These are vectorized operations (we will learn more about this next project). Instead of using the `mean` function, use a loop to calculate the mean (average), just like the previous project. Do not use `sum` either. - -[TIP] -==== -Remember, if a value is NA, we don't want to include it. -==== - -[TIP] -==== -Remember, if a value is -999, it means the ride is closed, we don't want to include it. -==== - -[NOTE] -==== -This exercise should make you appreciate the variety of useful functions R has to offer! -==== - -.Items to submit -==== -- R code used to solve the problem w/comments explaining what the code does. -- The mean posted wait time. -==== - -=== Question 2 - -Choose one of the `.csv` files containing data for a ride. Use `read.csv` to load the file into a data.frame named `ride_name` where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called `status`. `status` should contain a string whose value is either "open", or "closed". If `SPOSTMIN` or `SACTMIN` is -999, classify the row as "closed". Otherwise, classify the row as "open". After `status` is added to your data.frame, convert the column to a `factor`. - -[TIP] -==== -If you want to access two columns at once from a data.frame, you can do: `splash_mountain[i, c("SPOSTMIN", "SACTMIN")]`. -==== - -[NOTE] -==== -For loops are often [much slower (here is a video to demonstrate)](#r-for-loops-versus-vectorized-functions) than vectorized functions, as we will see in (3) below. -==== - -.Items to submit -==== -- R code used to solve the problem w/comments explaining what the code does. -- The output from running `str` on `ride_name`. -==== - -In this video, we basically go all the way through Question 2 using a video: - -++++ - -++++ - -=== Question 3 - -Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don't worry)) when they aren't needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the `which` function). Which method was faster? - -[TIP] -==== -To have multiple conditions within the `which` statement, use `|` for logical OR and `&` for logical AND. -==== - -[TIP] -==== -You can start by assigning every value in `status` as "open", and then change the correct values to "closed". -==== - -[NOTE] -==== -Here is a [complete example (very much like question 3) with another video](#r-example-safe-versus-contaminated) that shows how we can classify objects. -==== - -[NOTE] -==== -Here is a [complete example with a video](#r-example-for-loops-compared-to-vectorized-functions) that makes a comparison between the concept of a for loop versus the concept for a vectorized function. -==== - -.Items to submit -==== -- R code used to solve the problem w/comments explaining what the code does. -- The output from running `str` on `ride_name`. -==== - -=== Question 4 - -Create a pie chart for open vs. closed for `splash_mountain.csv`. First, use the `table` command to get a count of each `status`. Use the resulting table as input to the `pie` function. Make sure to give your pie chart a title that somehow indicates the ride to the audience. - -.Items to submit -==== -- R code used to solve the problem w/comments explaining what the code does. -- The resulting plot displayed as output in the RMarkdown. -==== - -=== Question 5 - -Loop through the vector of files we've provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride. - -[source,r] ----- -ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster") -ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv") ----- - -[TIP] -==== -To place all of the resulting pie charts in the same image, prior to running the for loop, run `par(mfrow=c(2,3))`. -==== - -This is not exactly the same, but it is a similar example, using the campaign election data: - -[source,r] ----- -mypiechart <- function(x) { - myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|") - mystate <- rep("other", times=nrow(myDF)) - mystate[myDF$STATE == "CA"] <- "California" - mystate[myDF$STATE == "TX"] <- "Texas" - mystate[myDF$STATE == "NY"] <- "New York" - myDF$stateclassification <- factor(mystate) - pie(table(myDF$stateclassification)) -} -myyears <- c("1980","1984","1988","1992","1996","2000") -par(mfrow=c(2,3)) -for (i in myyears) { - mypiechart(i) -} ----- - -++++ - -++++ - -Here is another video, which guides students even more closely through Question 5. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem w/comments explaining what the code does. -- The resulting plot displayed as output in the RMarkdown. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc deleted file mode 100644 index 24a04054f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project05.adoc +++ /dev/null @@ -1,152 +0,0 @@ -= STAT 19000: Project 5 -- Fall 2020 - -**Motivation:** As briefly mentioned in project 4, R differs from other programming languages in that _typically_ you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops. - -**Context:** While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things. - -**Scope:** r, data.frames, recycling, factors, if/else, for - -.Learning objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: for loops . -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/fars` - -To get more information on the dataset, see https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812602[here]. - -== Questions - -=== Question 1 - -The `fars` dataset contains a series of folders labeled by year. In each year folder there is (at least) the files `ACCIDENT.CSV`, `PERSON.CSV`, and `VEHICLE.CSV`. If you take a peek at any `ACCIDENT.CSV` file in any year, you'll notice that the column `YEAR` only contains the last two digits of the year. Add a new `YEAR` column that contains the _full_ year. Use the `rbind` function to create a data.frame called `accidents` that combines the `ACCIDENT.CSV` files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that `accidents` data frame, change the values in the `YEAR` column from two digits to four digits (i.e., paste a 19 onto each year value). - -Here is a video to walk you through the method of solving Question 1. - -++++ - -++++ - -Here is another video, using two functions you have not (yet) learned, namely, `lapply` and `do.call`. You do **not** need to understand these yet. _It is just a glimpse of some powerful functions to come later in the course!_ - -++++ - -++++ - - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The result of `unique(accidents$YEAR)`. -==== - -== Question 2 - -Using the new `accidents` data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus? - -[TIP] -==== -Look at the variables `DRUNK_DR` and `SCH_BUS`. -==== - -Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers). - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The result/answer itself. -==== - -=== Question 3 - -Again using the `accidents` data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents? - -Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The results. -- Which year had the most qualifying accidents. -==== - -=== Question 4 - -Again using the `accidents` data frame: Calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers, where i takes the values from 0 through 6. - -[TIP] -==== -It is OK that there are no accidents involving just 5 drunk drivers. -==== - -[TIP] -==== -You can use either a `for` loop or a `tapply` function to accomplish this question. -==== - -Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with `i` drunk drivers, where `i` takes the values from 0 through 6. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The output from running your code. -==== - -=== Question 5 - -Again using the `accidents` data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which `STATE` is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by `STATE` and by month (`MONTH`) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not? - -We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The output (plot) from running your code. -- 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you. -==== - -=== OPTIONAL QUESTION - -Spruce up your plot from (5). Do any of the following: - -- Add vibrant (and preferably colorblind friendly) colors to your plot -- Add a title -- Add a legend -- Add month names or abbreviations instead of numbers - -[TIP] -==== -https://www.r-graph-gallery.com/209-the-options-of-barplot.html[Here] is a resource to get you started. -==== - -.Items to submit -==== -- R code used to solve the problem/comments explaining what the code does. -- The output (plot) from running your code. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc deleted file mode 100644 index 149f198d1..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project06.adoc +++ /dev/null @@ -1,271 +0,0 @@ -= STAT 19000: Project 6 -- Fall 2020 - -The `tapply` function works like this: - -`tapply( somedata, thewaythedataisgrouped, myfunction)` - -[source,r] ----- -myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv") -head(myDF) ----- - -We could do four computations to compute the `mean` `SPEND` amount in each `STORE_R`... - -[source,r] ----- -mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"]) -mean(myDF$SPEND[myDF$STORE_R == "EAST "]) -mean(myDF$SPEND[myDF$STORE_R == "SOUTH "]) -mean(myDF$SPEND[myDF$STORE_R == "WEST "]) ----- - -...but it is easier to do all four of these calculations with the `tapply` function. We take a `mean` of the `SPEND` values, broken into groups according to the `STORE_R`: - -[source,r] ----- -tapply( myDF$SPEND, myDF$STORE_R, mean) ----- - -We could find the total amount in the `SPEND` column in 2016 and then again in 2017... - -[source,r] ----- -sum(myDF$SPEND[myDF$YEAR == "2016"]) -sum(myDF$SPEND[myDF$YEAR == "2017"]) ----- - -...or we could do both of these calculations at once, using the `tapply` function. We take the `sum` of all `SPEND` amounts, broken into groups according to the `YEAR` - -[source,r] ----- -tapply(myDF$SPEND, myDF$YEAR, sum) ----- - -As a last example, we can calculate the amount spent on each day of purchases. -We take the `sum` of all `SPEND` amounts, broken into groups according to the `PURCHASE_` day: - -[source,r] ----- -tapply(myDF$SPEND, myDF$PURCHASE_, sum) ----- - -[source,r] ----- -tail(sort( tapply(myDF$SPEND, myDF$PURCHASE_, sum) ),n=20) ----- - -It makes sense to sort the results and then look at the 20 days on which the `sum` of the `SPEND` amounts were the highest. - -++++ - -++++ - -[source,r] ----- -tapply( mydata, mygroups, myfunction, na.rm=T ) ----- - -Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way: - -[source,r] ----- -myfunction(mydata[mygroups == 1], na.rm=T) -myfunction(mydata[mygroups == 2], na.rm=T) -myfunction(mydata[mygroups == 3], na.rm=T) .... -myfunction(mydata[mygroups == "IN"], na.rm=T) -myfunction(mydata[mygroups == "OH"], na.rm=T) -myfunction(mydata[mygroups == "IL"], na.rm=T) .... ----- - - -[source,r] ----- -myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv") -head(myDF) ----- - -`sum` all flight `Distance`, split into groups according to the airline (`UniqueCarrier`). - -[source,r] ----- -sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum)) ----- - -Find the `mean` flight `Distance`, grouped according to the city of `Origin`. - -[source,r] ----- -sort(tapply(myDF$Distance, myDF$Origin, mean)) ----- - -Calculate the `mean` departure delay (`DepDelay`), for each airplane (i.e., each `TailNum`), using `na.rm=T` because some of the values of the departure delays are `NA`. - -[source,r] ----- -tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20) ----- - -++++ - -++++ - - -[source,r] ----- -library(data.table) -myDF <- fread("/class/datamine/data/election/itcont2016.txt", sep="|") -head(myDF) ----- - -`sum` the amounts of all contributions made, grouped according to the `STATE` where the people lived. - -[source,r] ----- -sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum)) ----- - -`sum` the amounts of all contributions made, grouped according to the `CITY`/`STATE` where the people lived. - -[source,r] ----- -tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20) -mylocations <- paste(myDF$CITY, myDF$STATE) -tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20) ----- - -`sum` the amounts of all contributions made, grouped according to the `EMPLOYER` where the people worked. - -[source,r] ----- -tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30) ----- - -++++ - -++++ - -**Motivation:** `tapply` is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop. - -**Context:** The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called `tapply` from the "apply suite" of functions in R. - -**Scope:** r, for, tapply - -.Learning objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc. -- Demonstrate how apply functions are generally faster than using loops. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/fars/7581.csv` - - -== Questions - -[NOTE] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[NOTE] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -The dataset, `/class/datamine/data/fars/7581.csv` contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named `dat`. In the previous project's question 4, we asked you to calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the `tapply` function instead. Which method did you prefer and why? - -Now that you've read the data into a dataframe named `dat`, run the following code: - -[source,r] ----- -# Read in data that maps state codes to state names -state_names <- read.csv("/class/datamine/data/fars/states.csv") -# Create a vector of state names called v -v <- state_names$state -# Set the names of the new vector to the codes -names(v) <- state_names$code -# Create a new column in the dat dataframe with the actual names of the states -dat$mystates <- v[as.character(dat$STATE)] ----- - -.Items to submit -==== -- R code used to solve the problem. -- The output/solution. -==== - -=== Question 2 - -Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident? - -.Items to submit -==== -- R code used to solve the problem. -- The entire output. -- Which state has the highest average number of drunk drivers per accident? -==== - -=== Question 3 - -Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match. - -[TIP] -==== -Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown. -==== - -This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- What days have the highest number of fatalities? -- What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents? -==== - -=== Question 4 - -How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code. - -[TIP] -==== -The `ALIGNMNT` variable is 1 for straight, 2 for curved, and 9 for unknown. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Results from running the R code. -==== - -=== Question 5 - -Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals. - -This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- Results from running the R code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc deleted file mode 100644 index b3d80d09a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project07.adoc +++ /dev/null @@ -1,166 +0,0 @@ -= STAT 19000: Project 7 -- Fall 2020 - -**Motivation:** Three bread-and-butter functions that are a part of the base R are: `subset`, `merge`, and `split`. `subset` provides a more natural way to filter and select data from a data.frame. `split` is a useful function that splits a dataset based on one or more factors. `merge` brings the principals of combining data that SQL uses, to R. - -**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data! - -**Scope:** r, subset, merge, split, tapply - -.Learning objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/goodreads/csv` - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -Load up the following two datasets `goodreads_books.csv` and `goodreads_book_authors.csv` into the data.frames `books` and `authors`, respectively. How many columns and rows are in each of these two datasets? - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 2 - -We want to figure out how book size (`num_pages`) is associated with various metrics. First, let's create a vector called `book_size`, that categorizes books into 4 categories based on `num_pages`: `small` (up to 250 pages), `medium` (250-500 pages), `large` (500-1000 pages), `huge` (1000+ pages). - -[NOTE] -==== -This [video and code](#r-lapply-flight-example) might be helpful. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of `table(book_size)`. -==== - -=== Question 3 - -Use `tapply` to calculate the mean `average_rating`, `text_reviews_count`, and `publication_year` by `book_size`. Did any of the result surprise you? Why or why not? - -.Items to submit -==== -- R code used to solve the problem. -- The output from running the R code. -==== - -=== Question 4 - -Notice in (3) how we used `tapply` 3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use `split`, `lapply`, and `colMeans` to perform the same calculations. - -Use `split` to partition the data containing only the following 3 columns: `average_rating`, `text_reviews_count`, and `publication_year`, by `book_size`. Save the result as `books_by_size`. What class is the result? `lapply` is a function that allows you to loop over each item in a list and apply a function. Use `lapply` and `colMeans` to perform the same calculation as in (3). - -[NOTE] -==== -This [video and code](#r-lapply-flight-example) and also this [video and code](#r-lapply-fars-example) might be helpful. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The output from running the code. -==== - -=== Question 5 - -We are working with a lot more data than we really want right now. We've provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called `en_books`. - -[source,r] ----- -en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")] ----- - -Now create an equivalent data frame of your own, by using the `subset` function (instead of indexing). Use `res` as the name of the data frame that you create. -Do the dimensions (using `dim`) of `en_books` and `res` agree? Why or why not? (They should both have 8 columns, but a different number of rows.) - -[TIP] -==== -Since the dimensions don't match, take a look at NA values for the variables used to subset our data. -==== - -[NOTE] -==== -This [video and code](#r-subset-8451-example) and also this [video and code](#r-subset-election-example) might be helpful. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Do the dimensions match? -- 1-2 sentences explaining why or why not. -==== - -=== Question 6 - -We now have a nice and tidy subset of data, called `res`. It would be really nice to get some information on the authors. We can find that information in `authors` dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a *much* better and easier way to solve these types of problems. Use the `merge` function to combine `res` and `authors` in a way which appends all information from `author` when there is a match in `res`. Use the condition `by="author_id"` in the merge. This is all you need to do: - -[source,r] ----- -mymergedDF <- merge(res, authors, by="author_id") ----- - -[NOTE] -==== -The resulting data frame will have all of the columns that are found in either `res` or `authors`. When we perform the merge, we only insist that the `author_id` should match. We do not expect that the `ratings_count` or `average_rating` should agree in `res` versus `authors`. Why? In the `res` data frame, the `ratings_count` and `average_rating` refer to the specific book, but in the `authors` data frame, the `ratings_count` and `average_rating` refer to the total works by the author. Therefore, in `mymergedDF`, there are columns `ratings_count.x` and `average_rating.x` from `res`, and there are columns `ratings_count.y` and `average_rating.y` from `authors`. -==== - -[NOTE] -==== -Although we provided the necessary code for this example, you might want to know more about the merge function. This [video and code](#r-merge-fars-example) and also this [video and code](#r-merge-flights-example) might be helpful. -==== - -.Items to submit -==== -- the given R code used to solve the problem. -- The `dim` of the newly merged data.frame. -==== - -=== Question 7 - -For an author of your choice (that _is_ in the dataset), find the author's highest rated book. Do you agree? - -.Items to submit -==== -- R code used to solve the problem. -- The title of the highest rated book (from your author). -- 1-2 sentences explaining why or why not you agree with it being the highest rated book from that author. -==== - -=== OPTIONAL QUESTION - -Look at the column names of the new dataframe created in question 6. Notice that there are two values for `ratings_count` and two values for `average_rating`. The names that have an appended `x` are those values from the first argument to `merge`, and the names that have an appended `y`, are those values from the second argument to `merge`. Rename these columns to indicate if they refer to a book, or an author. - -[TIP] -==== -For example, `ratings_count.x` could be `ratings_count_book` or `ratings_count_author`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The `names` of the new data.frame. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc deleted file mode 100644 index 56d4c38a3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project08.adoc +++ /dev/null @@ -1,127 +0,0 @@ -= STAT 19000: Project 8 -- Fall 2020 - -**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code! - -**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon. - -**Scope:** r, functions - -.Learning objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -- Comprehend what a function is, and the components of a function in R. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/goodreads/csv` - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -Read in the same data, in the same way as the previous project (with the same names). We've provided you with the function below. How many arguments does the function have? Name all of the arguments. What is the name of the function? Replace the `description` column in our `books` data.frame with the same information, but with stripped punctuation using the function provided. - -[source,r] ----- -# A function that, given a string (myColumn), returns the string -# without any punctuation. -strip_punctuation <- function(myColumn) { - # Use regular expressions to identify punctuation. - # Replace identified punctuation with an empty string ''. - desc_no_punc <- gsub('[[:punct:]]+', '', myColumn) - - # Return the result - return(desc_no_punc) -} ----- - -[TIP] -==== -Since `gsub` accepts a vector of values, you can pass an entire vector to `strip_punctuation`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- How many arguments does the function have? -- What are the name(s) of all of the arguments? -- What is the name of the function? -==== - -=== Question 2 - -Use the `strsplit` function to split a string by spaces. Some examples would be: - -[source,r] ----- -strsplit("This will split by space.", " ") -strsplit("This. Will. Split. By. A. Period.", "\\.") ----- - -An example string is: - -[source,r] ----- -test_string <- "This is a test string with no punctuation" ----- - -Test out `strsplit` using the provided `test_string`. Make sure to copy and paste the code that declares `test_string`. If you counted the words shown in your results, would it be an accurate count? Why or why not? - -**Relevant topics:** [strsplit](#r-strsplit), [functions](#r-writing-functions) - -.Items to submit -==== -- R code used to solve the problem. -- 1-2 sentences explaining why or why not your count would be accurate. -==== - -=== Question 3 - -Fix the issue in (2), using `which`. You may need to `unlist` the `strsplit` result first. After you've accomplished this, you can count the remaining words! - -.Items to submit -==== -- R code used to solve the problem (including counting the words). -==== - -=== Question 4 - -We are finally to the point where we have code from questions (2) and (3) that we think we may want to use many times. Write a function called `count_words` which, given a string, `description`, returns the number of words in `description`. Test out `count_words` on the `description` from the second row of `books`. How many words are in the description? - -.Items to submit -==== -- R code used to solve the problem. -- The result of using the function on the `description` from the second row of `books`. -==== - -=== Question 5 - -Practice makes perfect! Write a function of your own design that is intended on being used with one of our datasets. Test it out and share the results. - -[NOTE] -==== -You could even pass (as an argument) one of our datasets to your function and calculate a cool statistic or something like that! Maybe your function makes a plot? Who knows? -==== - -.Items to submit -==== -- R code used to solve the problem. -- An example (with output) of using your newly created function. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc deleted file mode 100644 index 046f8d380..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project09.adoc +++ /dev/null @@ -1,167 +0,0 @@ -= STAT 19000: Project 9 -- Fall 2020 - - -**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code! - -**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon. - -**Scope:** r, functions - -.Learning objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -- Comprehend what a function is, and the components of a function in R. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/goodreads/csv` - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -We've provided you with a function below. How many arguments does the function have, and what are their names? You can get a `book_id` from the URL of a goodreads book's webpage. - -Two examples: - -* If you search for the book `Words of Radiance` on goodreads, the `book_id` contained in the url https://www.goodreads.com/book/show/17332218-words-of-radiance#, is 17332218. -* https://www.goodreads.com/book/show/157993.The_Little_Prince?from_search=true&from_srp=true&qid=JJGqUK9Vp9&rank=1, (the little prince) with a `book_id` of 157993. - -Find 2 or 3 `book_id` and test out the function until you get two successes. Explain in words, what the function is doing, and what options you have. - -[source,r] ----- -library(imager) -books <- read.csv("/class/datamine/data/goodreads/csv/goodreads_books.csv") -authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv") -get_author_name <- function(my_authors_dataset, my_author_id){ - return(my_authors_dataset[my_authors_dataset$author_id==my_author_id,'name']) -} -fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) { - book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,] - all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,] - author_name <- get_author_name(my_authors_dataset, book_info$author_id) - - img <- load.image(book_info$image_url) - - if(display_cover){ - par(mfrow=c(1,2)) - plot(img, axes=FALSE) - } - - plot(all_books_by_author$num_pages, all_books_by_author$average_rating, - ylim=c(0,5.1), pch=21, bg='grey80', - xlab='Number of pages', ylab='Average rating', - main=paste('Books by', author_name)) - - points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5) -} ----- - -.Items to submit -==== -- How many arguments does the function have, and what are their names? -- The result of using the function on 2-3 `book_id`s. -- 1-2 sentences explaining what the function does (generally), and what (if any) options the function provides you with. -==== - -=== Question 2 - -You may have encountered a situation where the `my_book_id` was not in our dataset, and hence, didn't get plotted. When writing functions, it is usually best to try and foresee issues like this and have the function fail gracefully, instead of showing some ugly (and sometimes unclear) warning. Add some code at the beginning of our function that checks to see if `my_book_id` is within our dataset, and if it does not exist, prints "Book ID not found.", and exits the function. Test it out on `book_id=123` and `book_id=19063`. - -[TIP] -==== -Run `?stop` to see if that is a function that may be useful. -==== - -.Items to submit -==== -- R code with your new and improved function. -- The results from `fun_plot(123)`. -- The results from `fun_plot(19063)`. -==== - -=== Question 3 - -We have this nice `get_author_name` function that accepts a dataset (in this case, our `authors` dataset), and a `book_id` and returns the name of the author. Write a new function called `get_author_id` that accepts an authors name and returns the `author_id` of the author. - -You can test your function using some of these examples: - -[source,r] ----- -get_author_id(authors, "Brandon Sanderson") # 38550 -get_author_id(authors, "J.K. Rowling") # 1077326 ----- - -.Items to submit -==== -- R code containing your new function. -- The results of using your new function on a few authors. -==== - -=== Question 4 - -See the function below. - -[source,r] ----- -search_books_for_word <- function(word) { - return(books[grepl(word, books$description, fixed=T),]$title) -} ----- - -Given a word, `search_books_for_word` returns the titles of books where the provided word is inside the book's description. `search_books_for_word` utilizes the `books` dataset internally. It requires that the `books` dataset has been loaded into the environment prior to running (and with the correct name). By including and referencing objects defined _outside_ of our function's scope _within_ our function (in this case the variable `books`), our `search_books_for_word` function will be more prone to errors, as any changes to those objects may break our function. For example: - -[source,r] ----- -our_function <- function(x) { - print(paste("Our argument is:", x)) - print(paste("Our variable is:", my_variable)) -} -# our variable outside the scope of our_function -my_variable <- "dog" -# run our_function -our_function("first") -# change the variable outside the scope of our function -my_variable <- "cat" -# run our_function again -our_function("second") -# imagine a scenario where "my_variable" doesn't exist, our_function would break! -rm(my_variable) -our_function("third") ----- - -Fix our `search_books_for_word` function to accept the `books` dataset as an argument called `my_books_dataset` and utilize `my_books_dataset` within the function instead of the global variable `books`. - -.Items to submit -==== -- R code with your new and improved function. -- An example using the updated function. -==== - -=== Question 5 - -Write your own custom function. Make sure your function includes at least 2 arguments. If you access one of our datasets from within your function (which you _definitely_ should do), use what you learned in (4), to avoid future errors dealing with scoping. Your function could output a cool plot, interesting tidbits of information, or anything else you can think of. Get creative and make a function that is fun to use! - -.Items to submit -==== -- R code used to solve the problem. -- Examples using your function with included output. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc deleted file mode 100644 index de1aa08fa..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project10.adoc +++ /dev/null @@ -1,249 +0,0 @@ -= STAT 19000: Project 10 -- Fall 2020 - -**Motivation:** Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. In this project, we will learn to _apply_ functions to entire vectors of data using `sapply`. - -**Context:** We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R. - -**Scope:** r, sapply, functions - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/okcupid/filtered` - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - - -=== Question 1 - -Load up the the following datasets into data.frames named `users` and `questions`, respectively: `/class/datamine/data/okcupid/filtered/users.csv`, `/class/datamine/data/okcupid/filtered/questions.csv`. This is data from users on OkCupid, an online dating app. In your own words, explain what each file contains and how they are related -- its _always_ a good idea to poke around the data to get a better understanding of how things are structured! - -[TIP] -==== -Be careful, just because a file ends in `.csv`, does _not_ mean it is comma-separated. You can change what separator `read.csv` uses with the `sep` argument. You can use the `readLines` function on a file (say, with `n=10`, for instance), to see the first lines of a file, and determine the character to use with the `sep` argument. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- 1-2 sentences describing what each file contains and how they are related. -==== - -=== Question 2 - -`grep` is an incredibly powerful tool available to us in R. We will learn more about `grep` in the future, but for now, know that a simple application of `grep` is to find a word in a string. In R, `grep` is vectorized and can be applied to an entire vector of strings. Use `grep` to find a question that references "google". What is the question? - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -If at first you don't succeed, run `?grep` and check out the `ignore.case` argument. -==== - -[TIP] -==== -To prepare for Question 3, look at the entire row of the `questions` data frame that has the question about google. The first entry on this row tells you the question that you need, in the `users` data frame, while working on Question 3. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The `text` of the question that references Google. -==== - -=== Question 3 - -In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by `gender2`)? How about by `gender_orientation`? - -[TIP] -==== -The two videos posted in Question 2 might help. -==== - -[TIP] -==== -If you look at the column of `users` corresponding to the question identified in (2), you will see that this column of `users` has two possible answers, namely: `"No. Why spoil the mystery?"` and `"Yes. Knowledge is power!"`. -==== - -[TIP] -==== -Use the `tapply` function with three inputs: -==== - -the correct column of `users`, - -breaking up the data according to `gender2` or according to `gender_orientation`, - -and use this as your function in the `tapply`: - -`function(x) {prop.table(table(x, useNA="always"))}` - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the code. -- Written answers to the questions. -==== - -=== Question 4 - -In Project 8, we created a function called `count_words`. Use this function and `sapply` to create a vector which contains the number of words in each row of the column `text` from the `questions` dataframe. Call the new vector `question_length`, and add it as a column to the `questions` dataframe. - -[source,r] ----- -count_words <- function(my_text) { - my_split_text <- unlist(strsplit(my_text, " ")) - - return(length(my_split_text[my_split_text!=""])) -} ----- - -++++ - -++++ - -.Items to submit -==== -- R code used to solve this problem. -- The result of `str(questions)` (this shows how your `questions` data frame looks, after adding the new column called `question_length`). -==== - -=== Question 5 - -Consider this function called `number_of_options` that accepts a data frame (for instance, `questions`)... - -[source,r] ----- -number_of_options <- function(myDF) { - table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))})) -} ----- - -...and counts the number of questions that have each possible number of responses. For instance, if we calculate `number_of_options(questions)` we get: - -```` - 0 2 3 4 -590 936 519 746 -```` - -which means that: -590 questions have 0 possible responses; -936 questions have 2 possible responses; -519 questions have 3 possible responses; and -746 questions have 4 possible responses. - -Now use the `split` function to break the data frame `questions` into 7 smaller data frames, according to the value in `questions$Keywords`. Then use the `sapply` function to determine, for each possible value of `questions$Keywords`, the analogous breakdown of questions with different numbers of responses, as we did above. - -[TIP] -==== -You can write: - -[source,r] ----- -mylist <- split(questions, questions$Keywords) -sapply(mylist, number_of_options) ----- -==== - -++++ - -++++ - -The way `sapply` works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example: - -[source,r] ----- -test1 <- c(1, 2, 3, 4, NA, 5) -test2 <- c(9, 8, 6, 5, 4, NA) -mylist <- list(first=test1, second=test2) -# for a single vector in the list -mean(mylist$first, na.rm=T) -# what if we want to do this for each vector in the list? -# how do we remove na's? -sapply(mylist, mean) -# we can specify the arguments that are for the mean function -# by naming them after the first two arguments, like this -sapply(mylist, mean, na.rm=T) -# in the code shown above, na.rm=T is passed to the mean function -# just like if you run the following -mean(mylist$first, na.rm=T) -mean(mylist$second, na.rm=T) -# you can include as many arguments to mean as you normally would -# and in any order. just make sure to name the arguments -sapply(mylist, mean, na.rm=T, trim=0.5) -# or sapply(mylist, mean, trim=0.5, na.rm=T) -# which is similar to -mean(mylist$first, na.rm=T, trim=0.5) -mean(mylist$second, na.rm=T, trim=0.5) ----- - -.Items to submit -==== -- R code used to solve this problem. -- The results of the running the code. -==== - -=== Question 6 - -_Lots_ of questions are asked in this `okcupid` dataset. Explore the dataset, and either calculate an interesting statistic/result using `sapply`, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered. - -++++ - -++++ - -.Items to submit -==== -- R code used to solve this problem. -- The results from running your code. -- 1-2 sentences about your analysis and/or graphic, and explain what you thought you'd find, and what you actually discovered. -==== - -=== OPTIONAL QUESTION - -Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called `percent_answered` that, given a vector, returns the percentage of values that are not NA. Use `percent_answered` and `sapply` to calculate the percentage of users who answer each question. Plot this result, against the length of the questions. - -[TIP] -==== -`length_of_questions <- questions$question_length[grep("^q", questions$X)]` -==== - -[TIP] -==== -`grep("^q", questions$X)` returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our `users` data.frame before using `sapply` to apply `percent_answered`. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The plot. -- Whether or not you think there may or may not be an association between question length and whether or not the question is answered. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc deleted file mode 100644 index 431ef67be..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project11.adoc +++ /dev/null @@ -1,145 +0,0 @@ -= STAT 19000: Project 11 -- Fall 2020 - -**Motivation:** The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like. - -**Context:** You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow. - -**Scope:** R - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Comprehend what a function is, and the components of a function in R. -- Demonstrate the ability to use nested apply functions to solve a data-driven problem. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/zillow` - -== Questions - -=== Question 1 - -Read `/class/datamine/data/zillow/Zip_time_series.csv` into a data.frame called `zipc`. Look at the `RegionName` column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the `read.csv` documentation by running `?read.csv` and use an argument to make sure that column is not read in as an integer (which is _why_ zip codes starting with `0` lose the leading `0` when being read in). - -[TIP] -==== -This video demonstrates how to read in data and respect the leading zeroes. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- `head` of the `RegionName` column. -==== - -=== Question 2 - -One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index -- essentially an estimate of the home value). For those rows where both `MedianListingPrice_AllHomes` and `ZHVI_AllHomes` have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be? - -.Items to submit -==== -- R code used to solve the problem. -- The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result. -==== - -=== Question 3 - -Convert the `Date` column to a date using `as.Date`. How many years of data do we have in this dataset? Create a line plot with lines for the average `MedianListingPrice_AllHomes` and average `ZHVI_AllHomes` by year. The result should be a single plot with multiple lines on it. - -[TIP] -==== -Here we give two videos to help you with this question. The first video gives some examples about working with dates in R. -==== - -++++ - -++++ - -[TIP] -==== -This second video gives an example about how to plot two line graphs at the same time in R. -==== - -++++ - -++++ - -[TIP] -==== -For a nice addition, add a dotted vertical line on year 2008 near the housing crisis: -==== - -```{r, eval=F} -abline(v="2008", lty="dotted") -``` - -.Items to submit -==== -- R code used to solve the problem. -- The results of running the code. -==== - -=== Question 4 - -Read `/class/datamine/data/zillow/State_time_series.csv` into a data.frame called `states`. Calculate the average median listing price by state, and create a map using `plot_usmap` from the `usmap` package that shows the average median price by state. - -[TIP] -==== -We give a full example about how to plot values, by State, on a map. -==== - -++++ - -++++ - -[TIP] -==== -In order for `plot_usmap` to work, you must name the column containing states' names to "state". -==== - -[TIP] -==== -To split words like "OhSoCool" into "Oh So Cool", try this: `trimws(gsub('([[:upper:]])', ' \\1', "OhSoCool"))`. This will be useful as you'll need to correct the `RegionName` column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The resulting map. -==== - -=== Question 5 - -Read `/class/datamine/data/zillow/County_time_series.csv` into a data.frame named `counties`. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don't need to! There are other cool data! - -[TIP] -==== -Make sure that you remember to aggregate your data by `RegionName` so the plot renders correctly. -==== - -[TIP] -==== -`plot_usmap` looks for a column named `fips`. Make sure to rename the `RegionName` column to `fips` prior to passing the data.frame to `plot_usmap`. -==== - -[TIP] -==== -If you get Question 4 working correctly, here are the main differences for Question 5. You need the `regions` to be `"counties"` instead of `"states"`, and you need the `data.frame` to have a column called `fips` instead of `state`. These are the main differences between Question 4 and Question 5. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The resulting map. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc deleted file mode 100644 index 91bdd03ce..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project12.adoc +++ /dev/null @@ -1,160 +0,0 @@ -= STAT 19000: Project 12 -- Fall 2020 - -**Motivation:** In the previous project you were forced to do a little bit of date manipulation. Dates _can_ be very difficult to work with, regardless of the language you are using. `lubridate` is a package within the famous https://www.tidyverse.org/[tidyverse], that greatly simplifies some of the most common tasks one needs to perform with date data. - -**Context:** We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a https://www.tidyverse.org/[tidyverse] package that adds great stand-alone value when working with dates. - -**Scope:** r - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demostrate the ability to create basic graphs with default settings. -- Demonstratre the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Convert strings to dates, and format dates using the lubridate package. -**** - -== Questions - -=== Question 1 - -Let's continue our exploration of the Zillow time series data. A useful package for dealing with dates is called `lubridate`. This is part of the famous https://www.tidyverse.org/[tidyverse] suite of packages. Run the code below to load it. Read the `/class/datamine/data/zillow/State_time_series.csv` dataset into a data.frame named `states`. What class and type is the column `Date`? - -[source,r] ----- -library(lubridate) ----- - -.Items to submit -==== -- R code used to solve the question. -- `class` and `typeof` column `Date`. -==== - -=== Question 2 - -Convert column `Date` to a corresponding date format using `lubridate`. Check that you correctly transformed it by checking its class like we did in question (1). Compare and contrast this method of conversion with the solution you came up with for question (3) in the previous project. Which method do you prefer? - -[TIP] -==== -Take a look at the following functions from `lubridate`: `ymd`, `mdy`, `dym`. -==== - -[TIP] -==== -Here is a video about `ymd`, `mdy`, `dym` -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the question. -- `class` of modified column `Date`. -- 1-2 sentences stating which method you prefer (if any) and why. -==== - -=== Question 3 - -Create 3 new columns in `states` called `year`, `month`, `day_of_week` (Sun-Sat) using `lubridate`. Get the frequency table for your newly created columns. Do we have the same amount of data for all years, for all months, and for all days of the week? We did something similar in question (3) in the previous project -- specifically, we broke each date down by year. Which method do you prefer and why? - -[TIP] -==== -Take a look at functions `month`, `year`, `day`, `wday`. -==== - -[TIP] -==== -You may find the argument of `label` in `wday` useful. -==== - -[TIP] -==== -Here is a video about `month`, `year`, `day`, `wday` -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the question. -- Frequency table for newly created columns. -- 1-2 sentences answering whether or not we have the same amount of data for all years, months, and days of the week. -- 1-2 sentences stating which method you prefer (if any) and why. -==== - -=== Question 4 - -Is there a better month or set of months to put your house on the market? Use `tapply` to compare the average `DaysOnZillow_AllHomes` for all months. Make a barplot showing our results. Make sure your barplot includes "all of the fixings" (title, labeled axes, legend if necessary, etc. Make it look good.). - -[TIP] -==== -If you want to have the month's abbreviation in your plot, you may find both the `month.abb` object and the argument `names.arg` in `barplot` useful. -==== - -[TIP] -==== -This video might help with Question 4. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the question. -- The barplot of the average `DaysOnZillow_AllHomes` for all months. -- 1-2 sentences answering the question "Is there a better time to put your house on the market?" based on your results. -==== - -=== Question 5 - -Filter the `states` data to contain only years from 2010+ and call it `states2010plus`. Make a lineplot showing the average `DaysOnZillow_AllHomes` by `Date` using `states2010plus` data. Can you spot any trends? Write 1-2 sentences explaining what (if any) trends you see. - -.Items to submit -==== -- R code used to solve the question. -- The time series lineplot for the average `DaysOnZillow_AllHomes` per date. -- 1-2 sentences commenting on the patterns found in the plot, and your impressions of it. -==== - -=== Question 6 - -Do homes sell faster in certain states? For the following states: 'California', 'Indiana', 'NewYork' and 'Florida', make a lineplot for `DaysOnZillow_AllHomes` by `Date` with one line per state. Use the `states2010plus` dataset for this question. Make sure to have each state line colored differently, and to add a legend to your plot. Examine the plot and write 1-2 sentences about any observations you have. - -[TIP] -==== -You may want to use the `lines` function to add the lines for different state. -==== - -[TIP] -==== -Make sure to fix the y-axis limits using the `ylim` argument in `plot` to properly show all four lines. -==== - -[TIP] -==== -You may find the argument `col` useful to change the color of your line. -==== - -[TIP] -==== -To make your legend fit, consider using the states abbreviation, and the arguments `ncol` and `cex` of the `legend` function. -==== - -.Items to submit -==== -- R code used to solve the question. -- The time series lineplot for `DaysOnZillow_AllHomes` per date for the 4 states. -- 1-2 sentences commenting on the patterns found in the plot, and your answer to the question "Do homes sell faster in certain states rather than others?". -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc deleted file mode 100644 index 2730d53bf..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project13.adoc +++ /dev/null @@ -1,198 +0,0 @@ -= STAT 19000: Project 13 -- Fall 2020 - -**Motivation:** It's important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like `paste0` or `sapply`, and noticed that in the "usage" section, one of the arguments is an ellipsis (`...`). Well, unless you understand what this does, it's hard to really _get_ it. In this project, we will experiment with ellipsis, and write our own function that utilizes one. - -**Context:** We've learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions. - -**Scope:** r, functions - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demostrate the ability to create basic graphs with default settings. -- Demonstratre the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Convert strings to dates, and format dates using the lubridate package. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/beer/` - -== Questions - -=== Question 1 - -Read `/class/datamine/data/beer/beers.csv` into a data.frame named `beers`. Read `/class/datamine/data/beer/breweries.csv` into a data.frame named `breweries`. Read `/class/datamine/data/beer/reviews.csv` into a data.frame named `reviews`. - -[TIP] -==== -Notice that `reviews.csv` is a _large_ file. Luckily, you can use a function from the famous `data.table` package called `fread`. The function `fread` is _much_ faster at reading large file compared to `read.csv`. It reads the data into a class called `data.table`. We will learn more about this later on. For now, use `fread` to read in the `reviews.csv` data then convert it from the `data.table` class into a `data.frame` by wrapping the result of `fread` in the `data.frame` function. -==== - -[TIP] -==== -Do not forget to load the `data.table` library before attempeting to use the `fread` function. -==== - -Below we show you an example of how fast the `fread` function is compared to`read.csv`. - -[source,r] ----- -microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5) ----- - -```{txt} -Unit: milliseconds -expr -read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05) -data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)) - min lq mean median uq max neval - 5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589 5 - 120.7705 122.3812 127.9842 128.7794 133.7695 134.2205 5 -``` - -[TIP] -==== -This video demonstrates how to read the `reviews` data using `fread`. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -==== - -=== Question 2 - -Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in `breweries` don't have an associated beer in `beers`? How many beers in `beers` don't have an associated brewery in `breweries`? - -[TIP] -==== -We compare lists of names using `sum` or `intersect`. Similar techniques can be used for Question 2. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- A description of columns which connect each of the files. -- How many breweries don't have an associated beer in `beers`. -- How many beers don't have an associated brewery in `breweries`. -==== - -=== Question 3 - -Run `?sapply` and look at the usage section for `sapply`. If you look at the description for the `...` argument, you'll see it is "optional arguments to `FUN`". What this means is you can specify additional input for the function you are passing to `sapply`. One example would be passing `T` to `na.rm` in the mean function: `sapply(dat, mean, na.rm=T)`. Use `sapply` and the `strsplit` function to separate the types of breweries (`types`) by commas. Use another `sapply` to loop through your results and count the number of types for each brewery. Be sure to name your final results `n_types`. What is the average amount of services (`n_types`) breweries in IN and MI offer (we are looking for the average of IN and MI _combined_)? Does that surprise you? - -[NOTE] -==== -When you have one `sapply` inside of another, or one loop inside of another, or an if/else statement inside of another, this is commonly referred to as nesting. So when Googling, you can type "nested sapply" or "nested if statements", etc. -==== - -[TIP] -==== -We show, in this video, how to find the average number of parts in a midwesterner's name. Perhaps surprisingly, this same technique will be useful in solving Question 3. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the question. -- 1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer. -==== - -=== Question 4 - -Write a function called `compare_beers` that accepts a function that you will call `FUN`, and any number of vectors of beer ids. The function, `compare_beers`, should cycle through each vector/groups of `beer_id`s, compute the function, `FUN`, on the subset of `reviews`, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying `FUN` on the subset of the `reviews` data. - -In the example below the function `FUN` is the `median` function and we have two vectors/groups of `beer_id`s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our `compare_beers` function, we want to write the function in a way that we could pass as many vectors as we want to. - -Example: - -[source,r] ----- -compare_beers(reviews, median, c(271781), c(125646, 82352)) ----- - -This example gives the output: ----- -Group 1: 4 -Group 2: 4.56 ----- - -For your solution to this question, find the behavior of `compare_beers` in this example: -[source,r] ----- -compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355)) ----- - -[TIP] -==== -There are different approaches to this question. You can use for loops or `sapply`. It will probably help to start small and build slowly toward the solution. -==== - -[TIP] -==== -This first video shows how to use `...` in defining a function. -==== - -++++ - -++++ - -[TIP] -==== -This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- The result from running the provided example. -==== - -=== Question 5 - -Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot. - -[TIP] -==== -Create a vector of `beer_ids` per group before passing it to your function from (4). -==== - -[TIP] -==== -This video demonstrates an example of how to use the `compare_beers` function. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- The result from running your function. -- The resulting plot. -- 1-2 sentence commenting on your plot. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc deleted file mode 100644 index 91f5e5a79..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project14.adoc +++ /dev/null @@ -1,131 +0,0 @@ -= STAT 19000: Project 14 -- Fall 2020 - -**Motivation:** Functions are the building blocks of more complex programming. It's vital that you understand how to read and write functions. In this project we will incrementally build and improve upon a function designed to recommend a beer. Note that you will not be winning any awards for this recommendation system, it is just for fun! - -**Context:** One of the main focuses throughout the semester has been on functions, and for good reason. In this project we will continue to exercise our R skills and build up our recommender function. - -**Scope:** r, functions - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/beer/` - -== Questions - -=== Question 1 - -Read `/class/datamine/data/beer/beers.csv` into a data.frame named `beers`. Read `/class/datamine/data/beer/breweries.csv` into a data.frame named `breweries`. Read `/class/datamine/data/beer/reviews.csv` into a data.frame named `reviews`. As in the previous project, make sure you used the `fread` function from the `data.table` package, and convert the `data.table` to a `data.frame`. We want to create a very basic beer recommender. We will start simple. Create a function called `recommend_a_beer` that takes as input `my_beer_id` (a single value) and returns a vector of `beer_ids` from the same `style`. Test your function on `2093`. - -[TIP] -==== -Make sure you do not include the given `my_beer_id` in the vector of `beer_ids` containing the `beer_ids`of your recommended beers. -==== - -[TIP] -==== -You may find the function `setdiff` useful. Run the example below to get an idea of what it does. -==== - -[NOTE] -==== -You will not win any awards for this recommendation system! -==== - -[source,r] ----- -x <- c('a','b','b','c') -y <- c('c','b','d','e','f') -setdiff(x,y) -setdiff(y,x) ----- - -.Items to submit -==== -- R code used to solve the problem. -- Length of result from `recommend_a_beer(2093)`. -- The result of `2093 %in% recommend_a_beer(2093)`. -==== - -=== Question 2 - -That is a lot of beer recommendations! Let's try to narrow it down. Include an argument in your function called `min_score` with default value of 4.5. Our recommender will only recommend `beer_ids` with a review score of at least `min_score`. Test your improved beer recommender with the same `beer_id` from question (1). - -[TIP] -==== -Note that now we need to look at both `beers` and `reviews` datasets. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Length of result from `recommend_a_beer(2093)`. -==== - -=== Question 3 - -There is still room for improvement (obviously) for our beer recommender. Include a new argument in your function called `same_brewery_only` with default value `FALSE`. This argument will determine whether or not our beer recommender will return only beers from the same brewery. Test our newly improved beer recommender with the same `beer_id` from question (1) with the argument `same_brewery_only` set to `TRUE`. - -[TIP] -==== -You may find the function `intersect` useful. Run the example below to get an idea of what it does. - -[source,r] ----- -x <- c('a','b','b','c') -y <- c('c','b','d','e','f') -intersect(x,y) -intersect(y,x) ----- -==== - -.Items to submit -==== -- R code used to solve the problem. -- Length of result from `recommend_a_beer(2093, same_brewery_only=TRUE)`. -==== - -=== Question 4 - -Oops! Bad idea! Maybe including only beers from the same brewery is not the best idea. Add an argument to our beer recommender named `type`. If `type=style` our recommender will recommend beers based on the `style` as we did in question (3). If `type=reviewers`, our recommender will recommend beers based on reviewers with "similar taste". Select reviewers that gave score equal to or greater than `min_score` for the given beer id (`my_beer_id`). For those reviewers, find the `beer_ids` for other beers that these reviewers have given a score of at least `min_score`. These `beer_ids` are the ones our recommender will return. Be sure to test our improved recommender on the same `beer_id` as in (1)-(3). - -.Items to submit -==== -- R code used to solve the problem. -- Length of result from `recommend_a_beer(2093, type="reviewers")`. -==== - -=== Question 5 - -Let's try to narrow down the recommendations. Include an argument called `abv_range` that indicates the abv range we would like the recommended beers to be at. Set `abv_range` default value to `NULL` so that if a user does not specify the `abv_range` our recommender does not consider it. Test our recommender for `beer_id` 2093, with `abv_range = c(8.9,9.1)` and `min_score=4.9`. - -[TIP] -==== -You may find the function `is.null` useful. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Length of result from `recommend_a_beer(2093, abv_range=c(8.9, 9.1), type="reviewers", min_score=4.9)`. -==== - -=== Question 6 - -Play with our `recommend_a_beer` function. Include another feature to it. Some ideas are: putting a limit on the number of `beer_id`s we will return, error catching (what if we don't have reviews for a given `beer_id`?), including a plot to the output, returning beer names instead of ids or new arguments to decide what `beer_id`s to recommend. Be creative and have fun! - -.Items to submit -==== -- R code used to solve the problem. -- The result from running the improved `recommend_a_beer` function showcasing your improvements to it. -- 1-2 sentecens commenting on what you decided to include and why. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc deleted file mode 100644 index 01008df94..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/19000/19000-f2020-project15.adoc +++ /dev/null @@ -1,130 +0,0 @@ -= STAT 19000: Project 15 -- Fall 2020 - -**Motivation:** Some people say it takes 20 hours to learn a skill, some say 10,000 hours. What is certain is it definitely takes time. In this project we will explore an interesting dataset and exercise some of the skills learned this semester. - -**Context:** This is the final project of the semester. We sincerely hope that you've learned something, and that we've provided you with first hand experience digging through data. - -**Scope:** r - -.Learning objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/donorschoose/` - -== Questions - -=== Question 1 - -Read the data `/class/datamine/data/donorschoose/Projects.csv` into a data.frame called `projects`. Make sure you use the function you learned in Project 13 (`fread`) from the `data.table` package to read the data. Don't forget to then convert the `data.table` into a `data.frame`. Let's do an initial exploration of this data. What types of projects (`Project.Type`) are there? How many resource categories (`Project.Resource.Category`) are there? - -.Items to submit -==== -- R code used to solve the question. -- 1-2 sentences containing the project's types and how many resource categories are in the dataset. -==== - -=== Question 2 - -Create two new variables in `projects`, the number of days a project lasted and the number of days until the project was fully funded. Name those variables `project_duration` and `time_until_funded`, respectively. To calculate them use the project's posted date (`Project.Posted.Date`), expiration date (`Project.Expiration.Date`), and fully funded date (`Project.Fully.Funded.Date`). What are the shortest and longest times until a project is fully funded? For consistency check, see if we have any negative project's duration. If so, how many? - -[TIP] -==== -You _may_ find the argument `units` in the function `difftime` useful. -==== - -[TIP] -==== -Be sure to pay attention to the order of operations of `difftime`. -==== - -[TIP] -==== -Note that if you used the `fread` function from `data.table` to read in the data, you will not need to convert the columns as date. -==== - -[TIP] -==== -It is _not_ required that you use `difftime`. -==== - -.Items to submit -==== -- R code used to solve the question. -- Shortest and longest times until a project is fully funded. -- 1-2 sentences answering whether we have if we have negative project's duration, and if so how many. -==== - -=== Question 3 - -As you noted in question (2) there may be some project's with negative duration time. As we may have some concerns for the data regarding these projects, filter the `projects` data to exclude the projects with negative duration, and call this filtered data `selected_projects`. With that filtered data, make a `dotchart` for mean time until the project is fully funded (`time_until_funded`) for the various resource categories (`Project.Resource.Category`). Make sure to comment on your results. Are they surprising? Could there be another variable influencing this result? If so, name at least one. - -[TIP] -==== -You will first need to average time until fully funded for the different categories before making your plot. -==== - -[TIP] -==== -To make your `dotchart` look nicer, you may want to first order the average time until fully funded before passing it to the `dotchart` function. In addition, consider reducing the y-axis font size using the argument `cex`. -==== - -.Items to submit -==== -- R code used to solve the question. -- Resulting dotchart. -- 1-2 sentences commenting on your plot. Make sure to mention whether you are surprised or not by the results. Don't forget to add if you think there could be more factors influencing your answer, and if so, be sure to give examples. -==== - -=== Question 4 - -Read `/class/datamine/data/donorschoose/Schools.csv` into a data.frame called `schools`. Combine `selected_projects` and `schools` by `School.ID` keeping only `School.ID`s present in both datasets. Name the combined data.frame `selected_projects`. Use the newly combined data to determine the percentage of already fully funded projects (`Project.Current.Status`) for schools in West Lafayette, IN. In addition, determine the state (`School.State`) with the highest number of projects. Be sure to specify the number of projects this state has. - -[TIP] -==== -West Lafayette, IN zip codes are 47906 and 47907. -==== - -.Items to submit -==== -- R code used to solve the question. -- 1-2 sentences answering the percentage of already fully funded projects for schools in West Lafayette, IN, the state with the highest number of projects, and the number of projects this state has. -==== - -=== Question 5 - -Using the combined `selected_projects` data, get the school(s) (`School.Name`), city/cities (`School.City`) and state(s) (`School.State`) for the teacher with the highest percentage of fully funded projects (`Project.Current.Status`). - -[TIP] -==== -There are many ways to solve this problem. For example, one option to get the teacher's ID is to create a variable indicating whether or not the project is fully funded and use `tapply`. Another option is to create `prop.table` and select the corresponding column/row. -==== - -[TIP] -==== -Note that each row in the data corresponds to a unique project ID. -==== - -[TIP] -==== -Once you have the teacher's ID, consider filtering `projects` to contain only rows for which the corresponding teacher's ID is in, and only the columns we are interested in: `School.Name`, `School.City`, and `School.State`. Then, you can get the unique values in this shortened data. -==== - -[TIP] -==== -To get only certain columns when subetting, you may find the argument `select` from `subset` useful. -==== - -.Items to submit -==== -- R code used to solve the question. -- Output of your code containing school(s), city(s) and state(s) of the selected teacher. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc deleted file mode 100644 index 85637e788..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project01.adoc +++ /dev/null @@ -1,167 +0,0 @@ -= STAT 29000: Project 1 -- Fall 2020 - -**Motivation:** In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into. - -**Context:** We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned. - -**Scope:** data wrangling in R, functions - -.Learning objectives -**** -- Comprehend what a function is, and the components of a function in R. -- Read and write basic (csv) data. -- Utilize apply functions in order to solve a data-driven problem. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://thedatamine.github.io/the-examples-book - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -[IMPORTANT] -==== -It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials. -==== - -We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). That version of RStudio is known to have some strange issues when running code chunks. - -Remember the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in. - -You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run: - -[source,r] ----- -help(package=ggplot2) ----- - -Sometimes it can be helpful to see the source code of a defined function. A https://www.tutorialspoint.com/r/r_functions.htm[function] is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function's name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run: - -[source,r] ----- -Reduce ----- - -Occasionally this will be less useful as the resulting code will be code that calls `c` code we can't see. Other times it will allow you to understand the function better. - -== Dataset(s) - -`/class/datamine/data/airbnb` - -Often times (maybe even the majority of the time) data doesn't come in one nice file or database. Explore the datasets in `/class/datamine/data/airbnb`. - -== Questions - -=== Question 1 - -You may have noted that, for each country, city, and date we can find 3 files: `calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz` (for now, we will ignore all files in the "visualisations" folders). - -Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (`calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz`). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them. - -[TIP] -==== -`read.csv` has an argument to select the number of rows we want to read. -==== - -[TIP] -==== -Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries. -==== - -.Items to submit -==== -- Chunk of code used to read the first 50 rows of each dataset. -- 1-2 sentences briefly describing the information contained in each dataset. -- Name(s) of variable(s) that could be used to join them. -==== - -To read a compressed csv, simply use the `read.csv` function: - -[source,r] ----- -dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz") -head(dat) ----- - -Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the `listings.csv.gz` datasets. - -=== Question 2 - -Write a function called `get_paths_for_country`, that, given a string with the country name, returns a vector with the full paths for all `listings.csv.gz` files, starting with `/class/datamine/data/airbnb/...`. - -For example, the output from `get_paths_for_country("united-states")` should have 28 entries. Here are the first 5 entries in the output: - ----- - [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz" - [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz" - [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz" - [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz" - [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz" ----- - -[TIP] -==== -`list.files` is useful with the `recursive=T` option. -==== - -[TIP] -==== -Use `grep` to search for the pattern `listings.csv.gz` (within the results from the first hint), and use the option `value=T` to display the values found by the `grep` function. -==== - -.Items to submit -==== -- Chunk of code for your `get_paths_for_country` function. -==== - -=== Question 3 - -Write a function called `get_data_for_country` that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you. - -[TIP] -==== -Use `stringsAsFactors=F` in the `read.csv` function. -==== - -[TIP] -==== -Use `do.call(rbind, )` to combine a list of dataframes into a single dataframe. -==== - -.Items to submit -==== -- Chunk of code for your `get_data_for_country` function. -==== - -=== Question 4 - -Use your `get_data_for_country` to get the data for a country of your choice, and make sure to name the data.frame `listings`. Take a look at the following columns: `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, and `is_location_exact`. What is the data type for each column? (You can use `class` or `typeof` or `str` to see the data type.) - -These columns would make more sense as logical values (TRUE/FALSE/NA). - -Write a function called `transform_column` that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (`""`), and we need to be careful when transforming the data. Test your function on column `host_is_superhost`. - -.Items to submit -==== -- Chunk of code for your `transform_column` function. -- Type of `transform_column(listings$host_is_superhost)`. -==== - -=== Question 5 - -Apply your function `transform_column` to the columns `instant_bookable` and `is_location_exact` in your `listings` data. - -Based on your `listings` data, if you are looking at an instant bookable listing (where `instant_bookable` is `TRUE`), would you expect the location to be exact (where `is_location_exact` is `TRUE`)? Why or why not? - -[TIP] -==== -Make a frequency table, and see how many instant bookable listings have exact location. -==== - -.Items to submit -==== -- Chunk of code to get a frequency table. -- 1-2 sentences explaining whether or not we would expect the location to be exact if we were looking at a instant bookable listing. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc deleted file mode 100644 index 8000feb06..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project02.adoc +++ /dev/null @@ -1,167 +0,0 @@ -= STAT 29000: Project 2 -- Fall 2020 - -**Motivation:** The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124["Why Most Published Research Findings Are False"] by John Ioannidis (2005). - -**Context:** Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates. - -**Scope:** Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible. - -.Learning objectives -**** -- Use Markdown syntax within an Rmarkdown document to achieve various text transformations. -- Use RMarkdown code chunks to display and/or run snippets of code. -**** - -== Questions - -++++ - -++++ - -=== Question 1 - -Make the following text (including the asterisks) bold: `This needs to be **very** bold`. Make the following text (including the underscores) italicized: `This needs to be _very_ italicized.` - -[IMPORTANT] -==== -Surround your answer in 4 backticks. This will allow you to display the markdown _without_ having the markdown "take effect". For example: -==== - -`````markdown -```` -Some *marked* **up** text. -```` -````` - -[TIP] -==== -Be sure to check out the https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[Rmarkdown Cheatsheet] and our section on https://thedatamine.github.io/the-examples-book/r.html#r-rmarkdown[Rmarkdown in the book]. -==== - -[NOTE] -==== -Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown! -==== - -.Items to submit -==== -- 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text. -==== - -=== Question 2 - -Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another *ordered* list that ranks your academic interests in order of most interested to least interested. - -[TIP] -==== -You can learn what ordered and unordered lists are https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[here]. -==== - -[NOTE] -==== -Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here. -==== - -.Items to submit -==== -- Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists. -==== - -=== Question 3 - -Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following: - -- A header for this section (your choice of size) that says "About". -- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link. - -.Items to submit -==== -- Create the described profile, don't surround your code in backticks. -==== - -=== Question 4 - -Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum: - -- Make the title pronounced. -- Make all links appear as a word or words, rather than the long-form URL. -- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code. -- Make the calls to the `library` function be evaluated but not displayed. -- Make sure all warnings and errors that may eventually occur, do not appear in the final document. - -Feel free to make any other changes that make the report more visually pleasing. - -````markdown -`r ''````{r my-load-packages} -library(ggplot2) -``` - -`r ''````{r declare-variable-290, eval=FALSE} -my_variable <- c(1,2,3) -``` - -All About the Iris Dataset - -This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code: - -data() - -The iris dataset has 5 columns. You can get the names of the columns by running the following code: - -names(iris) - -Alternatively, you could just run the following code: - -iris - -The second option provides more detail about the dataset. - -According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html[this R manual], there is another dataset built-in to R called `iris3`. This dataset is 3 dimensional instead of 2 dimensional. - -An iris is a really pretty flower. You can see a picture of one here: - -https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg - -In summary. I really like irises, and there is a dataset in R called `iris`. -```` - -.Items to submit -==== -- Make improvements to this section, and place it all under the Question 4 header in your template. -==== - -=== Question 5 - -Create a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption. - -.Items to submit -==== -- Code chunk under that creates and displays a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`. -==== - -=== Question 6 - -Insert the following code chunk under the Question 6 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem? - -````markdown -```{r my-load-packages}`r ''` -plot(my_variable) -``` -```` - -[TIP] -==== -Take a close look at the name we give our code chunk. -==== - -[TIP] -==== -Take a look at the code chunk where `my_variable` is declared. -==== - -.Items to submit -==== -- The modified version of the inserted code that fixes both problems. -- A sentence explaining what the first problem was. -- A sentence explaining what the second problem was. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc deleted file mode 100644 index f331ec5d4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project03.adoc +++ /dev/null @@ -1,212 +0,0 @@ -= STAT 29000: Project 3 -- Fall 2020 - -**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful `bash` tools, help you navigate a filesystem, and even run `bash` tools from within an RMarkdown file in RStudio. - -**Context:** At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file. - -**Scope:** bash, RStudio - -.Learning objectives -**** -- Distinguish differences in /home, /scratch, and /class. -- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc. -- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc. -- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc. -- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc. -- Use `man` to read and learn about UNIX utilities. -- Run `bash` commands from within and RMarkdown file in RStudio. -**** - -There are a variety of ways to connect to Scholar. In this class, we will _primarily_ connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface. - -Here is a video to remind you about some of the basic tools you can use in UNIX/Linux: - -++++ - -++++ - -This is the easiest book for learning this stuff; it is short and gets right to the point: - -https://learning.oreilly.com/library/view/learning-the-unix/0596002610 - -You just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through). - -It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details. - -== Questions - -=== Question 1 - -Navigate to https://rstudio.scholar.rcac.purdue.edu/ and login. Take some time to click around and explore this tool. We will be writing and running Python, R, SQL, and `bash` all from within this interface. Navigate to `Tools > Global Options ...`. Explore this interface and make at least 2 modifications. List what you changed. - -Here are some changes Kevin likes: - -- Uncheck "Restore .Rdata into workspace at startup". -- Change tab width 4. -- Check "Soft-wrap R source files". -- Check "Highlight selected line". -- Check "Strip trailing horizontal whitespace when saving". -- Uncheck "Show margin". - -(Dr Ward does not like to customize his own environment, but he does use the emacs key bindings: Tools > Global Options > Code > Keybindings, but this is only recommended if you already know emacs.) - -.Items to submit -==== -- List of modifications you made to your Global Options. -==== - -=== Question 2 - -There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a `bash` shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient! - -What is the default directory of your bash shell? - -[TIP] -==== -Start by reading the section on `man`. `man` stands for manual, and you can find the "official" documentation for the command by typing `man `. For example: -==== - -```{bash, eval=F} -# read the manual for the `man` command -# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down -man man -``` - -.Items to submit -==== -- The full filepath of default directory (home directory). Ex: Kevin's is: `/home/kamstut` -- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched. -==== - -=== Question 3 - -Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order: - -- Write a single command to navigate to the folder containing our full datasets: `/class/datamine/data`. -- Write a command to confirm you are in the correct folder. -- Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories? -- Write another command to return back to your home directory. -- Write a command to confirm you are in the correct folder. - -Note: `/` is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the full filepath of Kevin's home directory. There is a folder `home` inside the root directory. Inside `home` is another folder named `kamstut` which is Kevin's home directory. - -.Items to submit -==== -- Command used to navigate to the data directory. -- Command used to confirm you are in the data directory. -- Command used to list files and folders. -- List of files and folders in the data directory. -- Command used to navigate back to the home directory. -- Command used to confirm you are in the home directory. -==== - -=== Question 4 - -Let's learn about two more important concepts. `.` refers to the current working directory, or the directory displayed when you run `pwd`. Unlike `pwd` you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called `my_file.txt` that lives in `/home/kamstut` (so, a full path of `/home/kamstut/my_file.txt`), and you are currently in `/home/kamstut`, you could run: `cat ./my_file.txt`. - -`..` represents the parent folder or the folder in which your current folder is contained. So let's say I was in `/home/kamstut/projects/` and I wanted to get the contents of the file `/home/kamstut/my_file.txt`. You could do: `cat ../my_file.txt`. - -When you navigate a directory tree using `.` and `..` you create paths that are called _relative_ paths because they are _relative_ to your current directory. Alternatively, a _full_ path or (_absolute_ path) is the path starting from the root directory. So `/home/kamstut/my_file.txt` is the _absolute_ path for `my_file.txt` and `../my_file.txt` is a _relative_ path. Perform the following actions, in order: - -- Write a single command to navigate to the data directory. -- Write a single command to navigate back to your home directory using a _relative_ path. Do not use `~` or the `cd` command without a path argument. - -.Items to submit -==== -- Command used to navigate to the data directory. -- Command used to navigate back to your home directory that uses a _relative_ path. -==== - -=== Question 5 - -In Scholar, when you want to deal with _really_ large amounts of data, you want to access scratch (you can read more https://www.rcac.purdue.edu/policies/scholar/[here]). Your scratch directory on Scholar is located here: `/scratch/scholar/$USER`. `$USER` is an environment variable containing your username. Test it out: `echo /scratch/scholar/$USER`. Perform the following actions: - -- Navigate to your scratch directory. -- Confirm you are in the correct location. -- Execute `myquota`. -- Find the location of the `myquota` bash script. -- Output the first 5 and last 5 lines of the bash script. -- Count the number of lines in the bash script. -- How many kilobytes is the script? - -[TIP] -==== -You could use each of the commands in the relevant topics once. -==== - -[TIP] -==== -When you type `myquota` on Scholar there are sometimes two warnings about `xauth` but sometimes there are no warnings. If you get a warning that says `Warning: untrusted X11 forwarding setup failed: xauth key data not generated` it is safe to ignore this error. -==== - -[TIP] -==== -Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the _options_ of a command in the `DESCRIPTION` section of the `man` pages. For example: `man wc`. You can see `-m`, `-l`, and `-w` are all options for `wc`. To test this out: -==== - -```{bash, eval=F} -# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command. -wc /class/datamine/data/flights/1987.csv -# to count the lines, use the -l option -wc -l /class/datamine/data/flights/1987.csv -# to count the words, use the -w option -wc -w /class/datamine/data/flights/1987.csv -# you can combine options as well -wc -w -l /class/datamine/data/flights/1987.csv -# some people like to use a single tack `-` -wc -wl /class/datamine/data/flights/1987.csv -# order doesn't matter -wc -lw /class/datamine/data/flights/1987.csv -``` - -[TIP] -==== -The `-h` option for the `du` command is useful. -==== - -.Items to submit -==== -- Command used to navigate to your scratch directory. -- Command used to confirm your location. -- Output of `myquota`. -- Command used to find the location of the `myquota` script. -- Absolute path of the `myquota` script. -- Command used to output the first 5 lines of the `myquota` script. -- Command used to output the last 5 lines of the `myquota` script. -- Command used to find the number of lines in the `myquota` script. -- Number of lines in the script. -- Command used to find out how many kilobytes the script is. -- Number of kilobytes that the script takes up. -==== - -=== Question 6 - -Perform the following operations: - -- Navigate to your scratch directory. -- Copy and paste the file: `/class/datamine/data/flights/1987.csv` to your current directory (scratch). -- Create a new directory called `my_test_dir` in your scratch folder. -- Move the file you copied to your scratch directory, into your new folder. -- Use `touch` to create an empty file named `im_empty.txt` in your scratch folder. -- Remove the directory `my_test_dir` _and_ the contents of the directory. -- Remove the `im_empty.txt` file. - -[TIP] -==== -`rmdir` may not be able to do what you think, instead, check out the options for `rm` using `man rm`. -==== - -.Items to submit -==== -- Command used to navigate to your scratch directory. -- Command used to copy the file, `/class/datamine/data/flights/1987.csv` to your current directory (scratch). -- Command used to create a new directory called `my_test_dir` in your scratch folder. -- Command used to move the file you copied earlier `1987.csv` into your new `my_test_dir` folder. -- Command used to create an empty file named `im_empty.txt` in your scratch folder. -- Command used to remove the directory _and_ the contents of the directory `my_test_dir`. -- Command used to remove the `im_empty.txt` file. -==== - -=== Question 7 - -Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan. diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc deleted file mode 100644 index 5b163c8f7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project04.adoc +++ /dev/null @@ -1,174 +0,0 @@ -= STAT 29000: Project 4 -- Fall 2020 - -**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. `grep` is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in. - -**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python. - -**Scope:** grep, regular expression basics, utilizing regular expression tools in R and Python - -.Learning objectives -**** -- Use `grep` to search for patterns within a dataset. -- Use `cut` to section off and slice up data from the command line. -- Use `wc` to count the number of lines of input. -**** - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://thedatamine.github.io/the-examples-book - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -[IMPORTANT] -==== -I would highly recommend using single quotes `'` to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to #faq-escape-characters[escaping] certain https://unix.stackexchange.com/questions/20804/in-a-regular-expression-which-characters-need-escaping[characters] in your regular expressions. -==== - -== Dataset - -The following questions will use the dataset `the_office_dialogue.csv` found in Scholar under the data directory `/class/datamine/data/`. A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/movies-and-tv/the_office_dialogue.csv[the_office_dialogue.csv] - -Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset. - -`grep` stands for _**g**lobally_ search for a _**r**egular_ _**e**xpression_ and _**p**rint_ matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. You can read about and see examples of `grep` https://thedatamine.github.io/the-examples-book/unix.html#grep[here]. - -== Questions - -=== Question 1 - -Login to Scholar and use `grep` to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica." Where is it located exactly? - -.Items to submit -==== -- The `grep` command used to find the dataset. -- The name and location in Scholar of the dataset. -==== - -=== Question 2 - -`grep` prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first _n_ lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in. - -Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself. - -[TIP] -==== -The result if you were to search for "bears. beets. battlestar galactica." should be: - ----- -"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica." ----- -==== - -[TIP] -==== -One method to solve this problem would be to https://thedatamine.github.io/the-examples-book/unix.html#piping-and-redirection[pipe] -the output from `grep` to https://thedatamine.github.io/the-examples-book/unix.html#cut[cut]. -==== - -.Items to submit -==== -- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.". -==== - -=== Question 3 - -This particular dataset happens to be very small. You could imagine a scenario where the file is many gigabytes and not easy to load completely into R or Python. We are interested in learning what makes Jim and Pam tick as a couple. Use a line of UNIX commands to create a new dataset called `jim_and_pam.csv` (remember, a good place to store data temporarily is `/scratch/scholar/$USER`). Include only lines that are spoken by either Jim or Pam, or reference Jim or Pam in any way. How many rows of data are in the new file? How many megabytes is the new file (to the nearest 1/10th of a megabyte)? - -[TIP] -==== -https://thedatamine.github.io/the-examples-book/unix.html#piping-and-redirection[Redirection]. -==== - -[TIP] -==== -It is OK if you get an erroneous line where the word "jim" or "pam" appears as a part of another word. -==== - -.Items to submit -==== -- The line of UNIX commands used to create the new file. -- The number of rows of data in the new file, and the accompanying UNIX command used to find this out. -- The number of megabytes (to the nearest 1/10th of a megabyte) that the new file has, and the accompanying UNIX command used to find this out. -==== - -=== Question 4 - -Find all lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark. Use only 1 "!" within your regular expression. How many lines are there? Ignore case (whether or not parts of the names are capitalized or not). - -.Items to submit -==== -- The UNIX command(s) used to solve this problem. -- The number of lines where either Jim/Pam/Michael/Dwight's name is followed by an exclamation mark. -==== - -=== Question 5 - -Find all lines that contain the text "that's what" followed by any amount of any text and then "said". How many lines are there? - -.Items to submit -==== -- The UNIX command used to solve this problem. -- The number of lines that contain the text "that's what" followed by any amount of text and then "said". -==== - -Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using `read.csv`. Name the resulting data.frame `dat`. - -=== Question 6 - -The `text_w_direction` column in `dat` contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions. - -(a) Create a new column called `has_direction` that is set to `TRUE` if the `text_w_direction` column has direction, and `FALSE` otherwise. Use the `grepl` function in R to accomplish this. - -[TIP] -==== -Make sure all opening brackets "[" have a corresponding closing bracket "]". -==== - -[TIP] -==== -Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text. -==== - -(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5. - -We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences. - ----- -This is a line with [emphasize this] only 1 direction! -This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug]. ----- - -Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match. - -In our actual dataset, for example, `dat$text_w_direction[2789]` is a line with 2 directions. - -.Items to submit -==== -- The R code and regular expression used to solve the first part of this problem. -- The R code and regular expression used to solve the second part of this problem. -- How many lines have >= 2 directions? -- How many lines have >= 5 directions? -==== - -=== OPTIONAL QUESTION - -Use the `str_extract_all` function from the `stringr` package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called `direction`. - ----- -This is a line with [emphasize this] only 1 direction! -This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug]. ----- - -In this question, your solution may have extracted: - ----- -[emphasize this] -[emphasize this] 2 sets of direction, do you see the difference [shrug] ----- - -(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.) - -.Items to submit -==== -- The R code used to solve this problem. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc deleted file mode 100644 index adce59862..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project05.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= STAT 29000: Project 5 -- Fall 2020 - -**Motivation:** Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. - -**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping. - -**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping - -.Learning objectives -**** -- Use `cut` to section off and slice up data from the command line. -- Use piping to string UNIX commands together. -- Use `sort` and it's options to sort data in different ways. -- Use `head` to isolate _n_ lines of output. -- Use `wc` to summarize the number of lines in a file or in output. -- Use `uniq` to filter out non-unique lines. -- Use `grep` to search files effectively. -**** - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://thedatamine.github.io/the-examples-book - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -Don't forget the very useful documentation shortcut `?` for R code. To use, simply type `?` in the console, followed by the name of the function you are interested in. In the Terminal, you can use the `man` command to check the documentation of `bash` code. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/amazon/amazon_fine_food_reviews.csv` - -A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/amazon/amazon_fine_food_reviews.csv[amazon_fine_food_reviews.csv] - -Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset. - -Here are three videos that might also be useful, as you work on Project 5: - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -== Questions - -=== Question 1 - -What is the `Id` of the most helpful review, according to the highest `HelpfulnessNumerator`? - -[IMPORTANT] -==== -You can always pipe output to `head` in case you want the first few values of a lot of output. Note that if you used `sort` before `head`, you may see the following error messages: - ----- -sort: write failed: standard output: Broken pipe -sort: write error ----- - -This is because `head` would truncate the output from `sort`. This is okay. See https://stackoverflow.com/questions/46202653/bash-error-in-sort-sort-write-failed-standard-output-broken-pipe[this discussion] for more details. -==== - -.Items to submit -==== -- Line of UNIX commands used to solve the problem. -- The `Id` of the most helpful review. -==== - -=== Question 2 - -Some entries under the `Summary` column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion. - -To further clarify what we mean by _unique_, if we had the following vector in R, `c("a", "b", "a", "c")`, its unique values are `c("a", "b", "c")`. - -.Items to submit -==== -- Two lines of UNIX commands used to solve the problem. -- The ratio of unique `Summary`'s. -==== - -=== Question 3 - -Use a chain of UNIX commands, piped in a sequence, to create a frequency table of `Score`. - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -- The frequency table. -==== - -=== Question 4 - -Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why? - -[TIP] -==== -You may need to pipe the output to `sort` multiple times. -==== - -[TIP] -==== -To create the frequency table, read through the `man` pages for `uniq`. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following: - -[source,bash] ----- -man uniq ----- -==== - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -- The frequency table. -==== - -=== Question 5 - -Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative. - -[TIP] -==== -You may find the solution to problem (3) useful. -==== - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -==== - -=== Question 6 - -Find the most helpful review with a `Score` of 5. Then (separately) find the most helpful review with a `Score` of 1. As before, we are considering the most helpful review to be the review with the highest `HelpfulnessNumerator`. - -[TIP] -==== -You can use multiple lines to solve this problem. -==== - -.Items to submit -==== -- The lines of UNIX commands used to solve the problem. -- `ProductId`'s of both requested reviews. -==== - -=== OPTIONAL QUESTION - -For **only** the two `ProductId`s from the previous question, create a new dataset called `scores.csv` that contains the `ProductId`s and `Score`s from all reviews for these two items. - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc deleted file mode 100644 index 2239a1f19..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project06.adoc +++ /dev/null @@ -1,211 +0,0 @@ -= STAT 29000: Project 6 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -== Dataset - -The following questions will use the dataset found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/YYYY.csv[here] or in Scholar: - -`/class/datamine/data/flights/subset/YYYY.csv` - -An example from 1987 data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here] or in Scholar: - -`/class/datamine/data/flights/subset/1987.csv` - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - - -=== Question 1 - -In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the `Origin`, from `1987.csv`. Write another line, this time using `awk` to do the same thing. Which one do you prefer, and why? - -Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk: - -++++ - -++++ - -.Items to submit -==== -- One line of UNIX commands to solve the problem *without* using `awk`. -- One line of UNIX commands to solve the problem using `awk`. -- 1-2 sentences describing which method you prefer and why. -==== - -=== Question 2 - -Write a bash script that accepts a year (1987, 1988, etc.) and a column *n* and returns the *nth* column of the associated year of data. - -Here are two examples to illustrate how to write a bash script: - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -In this example, you only need to turn in the content of your bash script (starting with `#!/bin/bash`) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it: - -[source,bash] ----- -#!/bin/bash -echo "First argument: $1" -echo "Second argument: $2" ----- -==== - -If you simply drop that text into a file called `my_script.sh`, located here: `/home/$USER/my_script.sh`, and if you run the following: - -[source,bash] ----- -# Setup bash to run; this only needs to be run one time per session. -# It makes bash behave a little more naturally in RStudio. -exec bash -# Navigate to the location of my_script.sh -cd /home/$USER -# Make sure that the script is runable. -# This only needs to be done one time for each new script that you write. -chmod 755 my_script.sh -# Execute my_script.sh -./my_script.sh okay cool ----- - -then it will print: - ----- -First argument: okay -Second argument: cool ----- - -In this example, if we were to turn in the "content of your bash script (starting with `#!/bin/bash`) in a code chunk, our solution would look like this: - -[source,bash] ----- -#!/bin/bash -echo "First argument: $1" -echo "Second argument: $2" ----- - -And although we aren't running the code chunk above, we know that it works because we tested it in the terminal. - -[TIP] -==== -Using `awk` you could have a script with just two lines: 1 with the "hash-bang" (`#!/bin/bash`), and 1 with a single `awk` command. -==== - -.Items to submit -==== -- The content of your bash script (starting with `#!/bin/bash`) in a code chunk. -==== - -=== Question 3 - -How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using `awk`, then solve this problem using *only* `awk`. - -Here is a similar example, using the election data set: - -++++ - -++++ - -.Items to submit -==== -- One line of UNIX commands to solve the problem *without* using `awk`. -- One line of UNIX commands to solve the problem using `awk`. -- The number of flights that arrived at Indianapolis (IND) in 2008. -==== - -=== Question 4 - -Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (`Origin`, `Dest`)? - -Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data: - -++++ - -++++ - -.Items to submit -==== -- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same. -- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same. -- The number of unique values per category (`Origin`, `Dest`). -==== - -=== Question 5 - -In (4) we found that there are not the same number of unique `Origin` as `Dest`. Find the https://en.wikipedia.org/wiki/International_Air_Transport_Association_code#Airport_codes[IATA airport code] for all `Origin` that don't appear in a `Dest` and all `Dest` that don't appear in an `Origin` in the 2008 data. - -[TIP] -==== -The examples on https://www.tutorialspoint.com/unix_commands/comm.htm[this page] should help. Note that these examples are based on https://tldp.org/LDP/abs/html/process-sub.html[Process Substitution] , which basically allows you to specify commands whose output would be used as the input of `comm`. There should be no space between the open bracket and open parenthesis, otherwise your bash will not work as intended. -==== - -.Items to submit -==== -- The line(s) of UNIX command(s) used to answer the question. -- The list of all `Origin` that don't appear in `Dest`. -- The list of all `Dest` that don't appear in `Origin`. -==== - -=== Question 6 - -What was the percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND"? What percentage of flights had "PHX" as `Origin` (among all flights with `Dest` of "IND")? - -Here is an example using the percentages of donations contributed from CEOs from various States: - -++++ - -++++ - -[TIP] -==== -You can do the mean calculation in awk by dividing the result from (3) by the number of unique `Origin` that have a `Dest` of "IND". -==== - -.Items to submit -==== -- The percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND". -- 1-2 sentences explaining how "PHX" compares (as a unique `ORIGIN`) to the other `Origin` (all with the `Dest` of "IND")? -==== - -=== OPTIONAL QUESTION - -Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like: - ----- -1987, 12345 -1988, 44 ----- - -Run the script with inputs: `1991` and `ORD`. Include the output in your submission. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The output of the script given `1991` and `ORD` as inputs. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc deleted file mode 100644 index 889bf4a7e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project07.adoc +++ /dev/null @@ -1,138 +0,0 @@ -= STAT 29000: Project 7 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: -`/class/datamine/data/flights/subset/YYYY.csv` - -An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here]. - -Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - - -=== Question 1 - -Write a line of code that prints a list of the unique values in the `DayOfWeek` column. Write a line of code that prints a list of the unique values in the `DayOfMonth` column. Write a line of code that prints a list of the unique values in the `Month` column. Use the `1987.csv` dataset. Are the results what you expected? - -.Items to submit -==== -- 3 lines of code used to get a list of unique values for the chosen columns. -- 1-2 sentences explaining whether or not the results are what you expected. -==== - -=== Question 2 - -Our files should have 29 columns. For a given file, write a line of code that prints any lines that do *not* have 29 columns. Test it on `1987.csv`, were there any rows without 29 columns? - -[TIP] -==== -See [here](#unix-awk-built-in-variables). `NF` looks like it may be useful! -==== - -.Items to submit -==== -- Line of code used to solve the problem. -- 1-2 sentences explaining whether or not there were any rows without 29 columns. -==== - -=== Question 3 - -Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do *not* have 29 columns. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The results of running your bash scripts from year 1987 to 2008. -==== - -=== Question 4 - -`awk` is a really good tool to quickly get some data and manipulate it a little bit. The column `Distance` contains the distances of the flights in miles. Use `awk` to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344. - -The following is an output example: - ----- -Miles: 12345 -Kilometers: 19867.35168 ----- - -.Items to submit -==== -- The code used to solve the problem. -- The results of running the code. -==== - -=== Question 5 - -Use `awk` to calculate the sum of the number of `DepDelay` minutes, grouped according to `DayOfWeek`. Use `2007.csv`. - -The following is an output example: - ----- -DayOfWeek: 0 -1: 1234567 -2: 1234567 -3: 1234567 -4: 1234567 -5: 1234567 -6: 1234567 -7: 1234567 ----- - -[NOTE] -==== -1 is Monday. -==== - -.Items to submit -==== -- The code used to solve the problem. -- The output from running the code. -==== - -=== Question 6 - -It wouldn't be fair to compare the total `DepDelay` minutes by `DayOfWeek` as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of `DepDelay` minutes by the number of flights per `DayOfWeek`. Use `2007.csv`. - -The following is an output example: - ----- -DayOfWeek: 0 -1: 1.234567 -2: 1.234567 -3: 1.234567 -4: 1.234567 -5: 1.234567 -6: 1.234567 -7: 1.234567 ----- - -.Items to submit -==== -- The code used to solve the problem. -- The output from running the code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc deleted file mode 100644 index f3821db8a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project08.adoc +++ /dev/null @@ -1,144 +0,0 @@ -= STAT 29000: Project 8 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: -`/class/datamine/data/flights/subset/YYYY.csv` - -An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here]. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true! - -Write a line of `awk` code that, prints the _total number of flights_ that occur on weekend days, followed by the _total number of flights_ that occur on the weekdays. Complete this calculation for 2008 using the `2008.csv` file. - -[NOTE] -==== -Under the column `DayOfWeek`, Monday through Sunday are represented by 1-7, respectively. -==== - -.Items to submit -==== -- Line of `awk` code that solves the problem. -- The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008. -==== - -=== Question 2 - -Note that in (1), we are comparing 3 days to 4! Write a line of `awk` code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008. - -[TIP] -==== -You don't need a large if statement to do this, you can use the `~` comparison operator. -==== - -.Items to submit -==== -- Line of `awk` code that solves the problem. -- The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008. -==== - -=== Question 3 - -We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit. - -Take a look at `airports.csv`. In particular run the following: - -[source,bash] ----- -head airports.csv ----- - -Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within `awk`. This is messy and we would prefer to create a new file called `new_airports.csv` without any quotes. Write a line of code to do this. - -[NOTE] -==== -You may be wondering *why* we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things. -==== - -[TIP] -==== -You could use `gsub` within `awk` to replace '"' with ''. You can find how to use `gsub` https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html[here]. -==== - -[TIP] -==== -If you leave out the column number argument to `gsub` it will apply the substitution to every field in every column. -==== - -[TIP] -==== -[source,bash] ----- -cat new_airports.csv | wc -l -# should be 159 without header ----- -==== - -.Items to submit -==== -- Line of `awk` code used to create the new dataset. -==== - -=== Question 4 - -Write a line of commands that creates a new dataset called `az_fl_airports.txt`. `az_fl_airports.txt` should _only_ contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3), `new_airports.csv` as a starting point. - -How many airports are there? Did you expect this? Use a line of bash code to count this. - -Create a new dataset called `az_fl_flights.txt` that contains all of the data for flights into or out of Florida and Arizona using the `2008.csv` file. Use the newly created dataset, `az_fl_airports.txt`, to accomplish this. - -[TIP] -==== -https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f -==== - -[TIP] -==== -[source,bash] ----- -cat az_fl_flights.txt | wc -l # should be 484705 ----- -==== - -.Items to submit -==== -- All UNIX commands used to answer the questions. -- The number of airports. -- 1-2 sentences explaining whether you expected this number of airports. -==== - -=== Question 5 - -Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The line of UNIX code you used to execute the script and create the new dataset. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc deleted file mode 100644 index 62aa8ae4e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project09.adoc +++ /dev/null @@ -1,235 +0,0 @@ -= STAT 29000: Project 9 -- Fall 2020 - -**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://www.cloudflare.com/[cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is _well_ worth your time! - -**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/lahman/lahman.db` - -This is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -[IMPORTANT] -==== -For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd -==== - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db") -``` -```` - -Each solution should then be placed in a code chunk like this: - -````markdown -```{sql, connection=lahman}`r ''` -SELECT * FROM batting LIMIT 1; -``` -```` - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -[source,bash] ----- -sqlite3 /class/datamine/data/lahman/lahman.db ----- - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=lahman, eval=F}`r ''` -SELECT * FROM batting LIMIT 1; -``` -```` - -This will allow the code to be displayed without throwing an error. - -=== Question 1 - -Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available? - -[TIP] -==== -To connect to the database, do the following: -==== - -```{bash, eval=F} -sqlite3 /class/datamine/data/lahman/lahman.db -``` - -[TIP] -==== -https://database.guide/2-ways-to-list-tables-in-sqlite-database/[This] is a good resource. -==== - -.Items to submit -==== -- How many tables are available in the Lahman database? -- The sqlite3 commands used to figure out how many tables are available. -==== - -=== Question 2 - -Some people like to try to https://www.washingtonpost.com/graphics/2017/sports/how-many-mlb-parks-have-you-visited/[visit all 30 MLB ballparks] in their lifetime. Use SQL commands to get a list of `parks` and the cities they're located in. For your final answer, limit the output to 10 records/rows. - -[NOTE] -==== -There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the `LIMIT` clause. -==== - -[TIP] -==== -Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following: - -[source,SQL] ----- -.headers on ----- -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 3 - -There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (`HR`) in a season. List their `playerID`, `yearID`, home run total, and the `teamID` they played for. - -[TIP] -==== -There are 8 occurrences of home runs greater than or equal to 60. -==== - -[TIP] -==== -The `batting` table is where you should look for this question. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 4 - -Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year. - -[TIP] -==== -The `people` table is where you should look for this question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 5 - -Get the Cleveland (CLE) Pitching Roster from the 2016 season (`playerID`, `W`, `L`, `SO`). Order the pitchers by number of Strikeouts (SO) in descending order. - -[TIP] -==== -The `pitching` table is where you should look for this question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 6 - -Find the 10 team and year pairs that have the most number of Errors (`E`) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs? - -[TIP] -==== -The `teams` table is where you should look for this question. -==== - -[TIP] -==== -The `BETWEEN` clause is useful here. -==== - -[TIP] -==== -It is OK to use multiple queries to answer the question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 7 - -Find the `playerID` for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table `pitching`)? What year and team did he win the most games as a manager (use table `managers`)? - -[TIP] -==== -It is OK to use multiple queries to answer the question. -==== - -[NOTE] -==== -There was a tie among the two years in which Bob Lemon had the most wins as a pitcher. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc deleted file mode 100644 index 45cb31d19..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project10.adoc +++ /dev/null @@ -1,210 +0,0 @@ -= STAT 29000: Project 10 -- Fall 2020 - -**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it _will_ start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data. - -**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems. -**** - -== Dataset - -The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at `scholar-db.rcac.purdue.edu`. -As in Project 9, this is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -[IMPORTANT] -==== -For this project all solutions should be done using R code chunks, and the `RMariaDB` package. Run the following code to load the library: - -[source,r] ----- -library(RMariaDB) ----- -==== - -=== Question 1 - -Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the `RMariaDB` package in R, and the credentials below. Confirm the connection by running the following code chunk: - -[source,r] ----- -con <- dbConnect(RMariaDB::MariaDB(), - host="scholar-db.rcac.purdue.edu", - db="lahmandb", - user="lahman_user", - password="HitAH0merun") -head(dbGetQuery(con, "SHOW tables;")) ----- - -[TIP] -==== -In the example provided, the variable `con` from the `dbConnect` function is the connection. Each query that you make, using the `dbGetQuery`, needs to use this connection `con`. You can change the name `con` if you want to (it is user defined), but if you change the name `con`, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the `dbConnect` line again, to reset your connection to the database. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your (potentially modified) `head(dbGetQuery(con, "SHOW tables;"))`. -==== - -=== Question 2 - -How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (`SB`) and hit at least 40 home runs (`HR`) in one year. - -[TIP] -==== -Use the `batting` table. -==== - -[IMPORTANT] -==== -You only need to run `library(RMariaDB)` and the `dbConnect` portion of the code a single time towards the top of your project. After that, you can simply reuse your connection `con` to run queries. -==== - -[IMPORTANT] -==== -In our xref:templates.adoc[project template], for this project, make all of the SQL queries using the `dbGetQuery` function, which returns the results directly in `R`. Therefore, your `RMarkdown` blocks for this project should all be `{r}` blocks (as opposed to the `{sql}` blocks used in Project 9). -==== - -[TIP] -==== -You can use `dbGetQuery` to run your queries from within R. Example: - -[source,r] ----- -dbGetQuery(con, "SELECT * FROM battings LIMIT 5;") ----- -==== - -[NOTE] -==== -We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use `RMariaDB` to solve this query. -==== - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 3 - -Find Corey Kluber's lifetime across his career (i.e., use `SUM` from `SQL` to summarize his achievements) in two categories: strikeouts (`SO`) and walks (`BB`). Also display his Strikeouts to Walks ratio. A Strikeout to Walks ratio is calculated by this equation: $\frac{Strikeouts}{Walks}$. - -++++ - -++++ - -[IMPORTANT] -==== -Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use `sum` in R rather than `SUM` in SQL. -==== - -[TIP] -==== -Use the `people` table to find the `playerID` and use the `pitching` table to find the statistics. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 4 - -How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"? - -[TIP] -==== -Use the `people` table to find the `playerID` and use the `batting` table to find the statistics. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 5 - -The https://en.wikipedia.org/wiki/Batting_average_(baseball)[Batting Average] is a metric for a batter's performance. The Batting Average in a year is calculated by stem:[\frac{H}{AB}] (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to `playerID`, `teamID`, and `yearID.` - -[TIP] -==== -Use the `batting` table. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 6 - -How many unique players have hit > 50 home runs (`HR`) in a season? - -[TIP] -==== -Use the `batting` table. -==== - -[TIP] -==== -If you view `DISTINCT` as being paired with `SELECT`, instead, think of it as being paired with one of the fields you are selecting. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 7 - -Find the number of unique players that attended Purdue University. Start by finding the `schoolID` for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic. - -[TIP] -==== -Use the `schools` table to get the `schoolID`s, and the `collegeplaying` table to get the statistics. -==== - -[TIP] -==== -You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates. -==== - -[TIP] -==== -To find IU's id, try the following query: `SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';`. You can find more about the LIKE clause and `%` https://www.tutorialspoint.com/sql/sql-like-clause.htm[here]. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc deleted file mode 100644 index 94b4ee5d4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project11.adoc +++ /dev/null @@ -1,215 +0,0 @@ -= STAT 29000: Project 11 -- Fall 2020 - -**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values. - -**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset! - -**Scope:** SQL, SQL in R - -.Learning objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc. -- Showcase the ability to filter, alias, and write subqueries. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where. -**** - -== Dataset - -The following questions will use the `elections` database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar: - -`/class/datamine/data/election/itcontYYYY.txt` (for example, data for year 1980 would be `/class/datamine/data/election/itcont1980.txt`) - -A public sample of the data can be found here: - -https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt) - -== Questions - -[IMPORTANT] -==== -For this project you will need to connect to the database `elections` using the `RMariaDB` package in R. Include the following code chunk in the beginning of your RMarkdown file: - -````markdown -```{r setup-database-connection}`r ''` -library(RMariaDB) -con <- dbConnect(RMariaDB::MariaDB(), - host="scholar-db.rcac.purdue.edu", - db="elections", - user="elections_user", - password="Dataelect!98") -``` -```` -==== - -When a question involves SQL queries in this project, you may use a SQL code chunk (with `{sql}`), or an R code chunk (with `{r}`) and functions like `dbGetQuery` as you did in Project 10. Please refer to Question 5 in the xref:templates.adoc[project template] for examples. - -=== Question 1 - -Approximately how large was the lahman database (use the sqlite database in Scholar: `/class/datamine/data/lahman/lahman.db`)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB). - -The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder `/class/datamine/data/election/`. How much data (in MB) is there? - -The data in that folder has been added to the `elections` database, all aggregated in the `elections` table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table `elections`? - -[NOTE] -==== -These are some examples of how to get the sizes of collections of files in UNIX: -==== - -++++ - -++++ - -[TIP] -==== -The SQL query will take some time! Be patient. -==== - -[NOTE] -==== -You may use more than one code chunk in your RMarkdown file for the different tasks. -==== - -[NOTE] -==== -We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from `ls` and `du` to match, use the `--apparent-size` option with `du`. -==== - -[NOTE] -==== -A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See https://en.wikipedia.org/wiki/Gigabyte[here] for more information. For this question, either solution will be given full credit. https://thedatamine.github.io/the-examples-book/unix.html#why-is-the-result-of-du--b-.metadata.csv-divided-by-1024-not-the-result-of-du--k-.metadata.csv[This] is a potentially useful example. -==== - -.Items to submit -==== -- Line of code (bash/awk) to show the size (in MB) of the lahman database file. -- Approximate size of the lahman database in MB. -- Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in `/class/datamine/data/election`. -- The size of the elections data in MB. -- SQL query used to find the number of rows of data in the `elections` table in the `elections` database. -- The number of rows in the `elections` table in the `elections` database. -==== - -=== Question 2 - -Write a SQL query using the `LIKE` command to find a unique list of `zip_code` that start with "479". - -Write another SQL query and answer: How many unique `zip_code` are there that begin with "479"? - -[NOTE] -==== -Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project. -==== - -++++ - -++++ - -[TIP] -==== -The first query returns a list of zip codes, and the second returns a count. -==== - -[TIP] -==== -Make sure you only select `zip_code`. -==== - -.Items to submit -==== -- SQL queries used to answer the question. -- The first 5 results from running the query. -==== - -=== Question 3 - -Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an _alias_ for our field so it doesn't read `COUNT(*)` but rather `Indiana Donations`. - -[TIP] -==== -You may enclose an alias's name in quotation marks (single or double) when the name contains space. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- The result of the SQL query. -==== - -=== Question 4 - -Rewrite the query in (3) so the result is displayed like: `IN: 1234567`. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to `OH: 1234567`. In other words, the state abbreviation should be dynamic, not static. - -[NOTE] -==== -This video demonstrates how to use CONCAT in a MySQL query: -==== - -++++ - -++++ - -[TIP] -==== -Use CONCAT and aliasing to accomplish this. -==== - -[TIP] -==== -Remember, `state` contains the state abbreviation. -==== - -.Items to submit -==== -- SQL query used to answer the question. -==== - -=== Question 5 - -In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes? - -[NOTE] -==== -This video gives two examples of sub-queries: -==== - -++++ - -++++ - -[TIP] -==== -You can simply manually calculate the percent using the count in (2) and (5). -==== - -.Items to submit -==== -- SQL queries used to answer the question. -- The percentage of donations from Indiana from `zip_code`s starting with "479". -==== - -=== Question 6 - -In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (`transaction_amt`). Which 5 states gave the most donations (highest count)? Order you result from most to least. - -[NOTE] -==== -In this video we demonstrate `GROUP BY`, `ORDER BY`, `DESC`, and other aspects of MySQL that might help with this question: -==== - -++++ - -++++ - -[TIP] -==== -You may want to create an alias in order to sort. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- Which 5 states gave the most donations? -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc deleted file mode 100644 index 144c1e421..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project12.adoc +++ /dev/null @@ -1,172 +0,0 @@ -= STAT 29000: Project 12 -- Fall 2020 - -**Motivation:** Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share. - -**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables. - -**Scope:** SQL, sqlite, joins - -.Learning objectives -**** -- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. -- Showcase the ability to filter, alias, and write subqueries. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db` - -A public sample of the data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/bay_area_bike_share/bay_area_bike_share.db[here]. - -[IMPORTANT] -==== -For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd: - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db") -``` -```` - -Each solution should then be placed in a code chunk like this: - -````markdown -```{sql, connection=bikeshare}`r ''` -SELECT * FROM station LIMIT 5; -``` -```` -==== - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -```{bash, eval=F} -sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db -``` - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=bikeshare, eval=F}`r ''` -SELECT * FROM station LIMIT 5; -``` -```` - -This will allow the code to be displayed without throwing an error. - -There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform *all* of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). - -== Questions - -=== Question 1 - -Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the `sqlite` concatenate operator `||` see https://www.sqlitetutorial.net/sqlite-string-functions/sqlite-concat/[here]. Write a query that returns the first 5 records of information from the `station` table formatted in the following way: - -`(id) name @ (lat, long)` - -For example: - -`(84) Ryland Park @ (37.342725, -121.895617)` - -[TIP] -==== -Here is a video about how to concatenate strings in SQLite. -==== - -++++ - -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -- The first 5 records of information from the `station` table. -==== - -=== Question 2 - -There is a variety of interesting weather information in the `weather` table. Write a query that finds the average `mean_temperature_f` by `zip_code`. Which is on average the warmest `zip_code`? - -Use aliases to format the result in the following way: - -```{txt} -Zip Code|Avg Temperature -94041|61.3808219178082 -``` -Note that this is the output if you use `sqlite` in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly. - -[TIP] -==== -Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT -==== - -++++ - -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -- The results of the query copy and pasted. -==== - -=== Question 3 - -From (2) we can see that there are only 5 `zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip` table? Write a query that finds the number of unique `zip_code`s in the `trip` table. Write another query that lists the `zip_code` and count of the number of times the `zip_code` appears. If we had originally assumed that the `zip_code` was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected `zip_code` values in the `trip` table? - -[TIP] -==== -There could be missing values in `zip_code`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here]. -==== - -.Items to submit -==== -- SQL queries used to solve this problem. -- 1-2 sentences explainging what a possible explanation for the `zip_code`s could be. -==== - -=== Question 4 - -In (2) we wrote a query that finds the average `mean_temperature_f` by `zip_code`. What if we want to tack on our results in (2) to information from each row in the `station` table based on the `zip_code`? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables. - -[TIP] -==== -Use the query from (2) as a sub query within your solution. -==== - -[TIP] -==== -Here is a video about JOIN and LEFT JOIN. -==== - -++++ - -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -==== - -=== Question 5 - -In (3) we alluded to the fact that many `zip_code` in the `trip` table aren't very consistent. Users can enter a zip code when using the app. This means that `zip_code` can be from anywhere in the world! With that being said, if the `zip_code` is one of the 5 `zip_code` for which we have weather data (from question 2), we can add that weather information to matching rows of the `trip` table. In (4) we used an INNER JOIN to append some weather information to each row in the `station` table. For this question, write a query that performs an INNER JOIN and appends weather data from the `weather` table to the trip data from the `trip` table. Limit your output to 5 lines. - -[IMPORTANT] -==== -Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the `trip` table, you can use `start_date` for for the date information. -==== - -[TIP] -==== -You will want to wrap your dates and datetimes in https://www.sqlitetutorial.net/sqlite-date-functions/sqlite-date-function/[sqlite's `date` function] prior to comparison. -==== - -.Items to submit -==== -- SQL query used to solve this problem. -- First 5 lines of output. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc deleted file mode 100644 index 3ceb4cb04..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project13.adoc +++ /dev/null @@ -1,155 +0,0 @@ -= STAT 29000: Project 13 -- Fall 2020 - -**Motivation:** Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like _primary key_, and _foreign key_ will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice! - -**Context:** We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Write and run SQL queries in `sqlite` on real-world data. -- Identify primary and foreign keys in a SQL table. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/movies_and_tv/imdb.db` - -[IMPORTANT] -==== -For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd: - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db") -``` -```` -==== - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -```{bash, eval=F} -sqlite3 /class/datamine/data/movies_and_tv/imdb.db -``` - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=imdb, eval=F}`r ''` -SELECT * FROM titles LIMIT 5; -``` -```` - -This will allow the code to be displayed without throwing an error. - -== Questions - -=== Question 1 - -A primary key is a field in a table which uniquely identifies a row in the table. Primary keys _must_ be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the `titles` table. Do you think there are any primary keys? How about foreign keys? Now examine the `episodes` table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys? - -[TIP] -==== -A primary key can also be a foreign key. -==== - -[TIP] -==== -Here are two videos. The first video will remind you how to find the names of all of the tables in the `imdb` database. The second video will introduce you to the `titles` and `episodes` tables in the `imdb` database. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- List any primary or foreign keys in the `titles` table. -- List any primary or foreign keys in the `episodes` table. -==== - -=== Question 2 - -If you paste a `title_id` to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series _Grey's Anatomy_. Write a SQL query to confirm that the `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. Then browse imdb.com and find your favorite TV show. Get the `title_id` from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database: - -[source,SQL] ----- -SELECT * FROM titles WHERE title_id=''; ----- - -Make sure to replace "<title id here>" with the `title_id` of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons. - -.Items to submit -==== -- SQL query used to confirm that `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. -- The output of the query. -- The `title_id` of your favorite TV show. -- SQL query used to confirm the `title_id` for your favorite TV show. -- The output of the query. -==== - -=== Question 3 - -The `episode_title_id` column in the `episodes` table references titles of individual episodes of a TV series. The `show_title_id` references the titles of the show itself. With that in mind, write a query that gets a list of all `episodes_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of _Grey's Anatomy_. - -[TIP] -==== -This video shows how to extract titles of episodes in the `imdb` database. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_uhg3atol&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_wmo98brv"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -==== - -=== Question 4 - -We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of `episode_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode. - -[TIP] -==== -This part is just like question (3) but this time with your favorite TV show, which you chose in (2). -==== - -[TIP] -==== -This video shows how to use a subquery, to `JOIN` a total of three tables in the `imdb` database. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_jb8vd4nc&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_sc5yje1a"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -- The first 5 results from your query. -==== - -=== Question 5 - -Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode? - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -- The `episode_title_id`, `primary_title`, and `rating` of the top rated episode from your favorite TV series, in question (2). -- A statement saying whether the highest rated episode is also your favorite episode. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc deleted file mode 100644 index 36c177b92..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project14.adoc +++ /dev/null @@ -1,133 +0,0 @@ -= STAT 29000: Project 14 -- Fall 2020 - -**Motivation:** As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester. - -**Context:** We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8. - -**Scope:** awk, UNIX utilities, bash scripts, fread - -.Learning objectives -**** -- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc. -- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc. -- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc. -- Use grep to search files effectively. -- Use cut to section off data from the command line. -- Use piping to string UNIX commands together. -- Use awk for data extraction, and preprocessing. -- Create bash scripts to automate a process or processes. -**** - -== Dataset - -The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar: - -`/anvil/projects/tdm/data/forest/` - -To read more about ENTIRE_PLOTSNAP.csv that you will be working with: - -https://www.uvm.edu/femc/data/archive/project/federal-forest-inventory-analysis-data-for/dataset/plot-level-data-gathered-through-forest/metadata#fields - -== Questions - -=== Question 1 - -Take a look at at `ENTIRE_PLOTSNAP.csv`. Write a line of awk code that displays the `STATECD` followed by the number of rows with that `STATECD`. - -.Items to submit -==== -- Code used to solve the problem. -- Count of the following `STATECD`s: 1, 2, 4, 5, 6 -==== - -=== Question 2 - -Unfortunately, there isn't a very accessible list available that shows which state each `STATECD` represents. This is no problem for us though, the dataset has `LAT` and `LON`! Write some bash that prints just the `STATECD`, `LAT`, and `LON`. - -[NOTE] -==== -There are 92 columns in our dataset: `awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv`. To create a list of `STATECD` to state, we only really need `STATECD`, `LAT`, and `LON`. Keeping the other 89 variables will keep our data at 2.6gb. -==== - -.Items to submit -==== -- Code used to solve the problem. -- The output of your code piped to `head`. -==== - -=== Question 3 - -`fread` is a "Fast and Friendly File Finagler". It is part of the very popular `data.table` package in R. We will learn more about this package next semester. For now, read the documentation https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fread[here] and use the `cmd` argument in conjunction with your bash code from (2) to read the data of `STATECD`, `LAT`, and `LON` into a `data.table` in your R environment. - -.Items to submit -==== -- Code used to solve the problem. -- The `head` of the resulting `data.table`. -==== - -=== Question 4 - -We are going to further understand the data from question (3) by finding the actual locations based on the `LAT` and `LON` columns. We can use the library `revgeo` to get a location given a pair of longitude and latitude values. `revgeo` uses a free API hosted by https://github.com/komoot/photon[photon] in order to do so. - -For example: - -[source,r] ----- -library(revgeo) -revgeo(longitude=-86.926153, latitude=40.427055, output='frame') ----- - -The code above will give you the address information in six columns, from the most-granular `housenumber` to the least-granular `country`. Depending on the coordinates, `revgeo` may or may not give you results for each column. For this question, we are going to keep only the `state` column. - -There are over 4 million rows in our dataset -- we do _not_ want to hit https://github.com/komoot/photon[photon's] API that many times. Instead, we are going to do the following: - -* Unless you feel comfortable using `data.table`, convert your `data.table` to a `data.frame`: - -[source,r] ----- -my_dataframe <- data.frame(my_datatable) ----- - -* Calculate the average `LAT` and `LON` for each `STATECD`, and call the new `data.frame`, `dat`. This should result in 57 rows of lat/long pairs. - -* For each row in `dat`, run a reverse geocode and append the `state` to a new column called `STATE`. - -[TIP] -==== -To calculate the average `LAT` and `LON` for each `STATECD`, you could use the https://www.rdocumentation.org/packages/sqldf/versions/0.4-11[`sqldf`] package to run SQL queries on your `data.frame`. -==== - -[TIP] -==== -https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family[`mapply`] is a useful apply function to use to solve this problem. -==== - -[TIP] -==== -Here is some extra help: - -[source,r] ----- -library(revgeo) -points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866)) -# Note that the "output" argument gets passed to the "revgeo" function. -mapply(revgeo, points$longitude, points$latitude, output="frame") -# The output isn't in a great format, and we'd prefer to just get the "state" data. -# Let's wrap "revgeo" into another function that just gets "state" and try again. -get_state <- function(lon, lat) { - return(revgeo(lon, lat, output="frame")["state"]) -} -mapply(get_state, points$longitude, points$latitude) ----- -==== - -[IMPORTANT] -==== -It is okay to get "Not Found" for some of the addresses. -==== - -.Items to submit -==== -- Code used to solve the problem. -- The `head` of the resulting `data.frame`. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc deleted file mode 100644 index f0c2eb117..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/29000/29000-f2020-project15.adoc +++ /dev/null @@ -1,102 +0,0 @@ -== STAT 29000: Project 15 -- Fall 2020 - -**Motivation:** We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems. - -**Context:** In this project, we will reinforce topics you've already learned, with a focus on SQL. - -**Scope:** SQL, sqlite, R - -.Learning objectives -**** -- Write and run SQL queries in `sqlite` on real-world data. -- Use SQL from within R. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -What is the first year where our database has > 1000 titles? Use the `premiered` column in the `titles` table as our year. What year has the most titles? - -[TIP] -==== -There could be missing values in `premiered`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here]. -==== - -.Items to submit -==== -- SQL queries used to answer the questions. -- What year is the first year to have > 1000 titles? -- What year has the most titles? -==== - -=== Question 2 - -What, and how many, unique `type` are there from the `titles` table? For the year found in question (1) with the most `titles`, how many titles of each `type` are there? - -.Items to submit -==== -- SQL queries used to answer the questions. -- How many and what are the unique `types` from the `titles` table? -- A list of `type` and and count for the year (`premiered`) that had the most `titles`. -==== - -F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes. - -=== Question 3 - -Write a query that gets the `episode_title_id`, `primary_title`, `rating`, and `votes`, of all of the episodes of Friends (`title_id` is tt0108778). - -[TIP] -==== -You can slightly modify the solution to question (5) in project 13. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- First 5 results of the query. -==== - -=== Question 4 - -Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name? - -.Items to submit -==== -- R code in a code chunk that was used to find the solution. -- The solution pasted below the code chunk. -==== - -=== Question 5 - -Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37. - -.Items to submit -==== -- The R code used to generate the graphic. -- The graphic in a png or jpg/jpeg format. -==== - -=== Question 6 - -Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see `type` column from `titles` table): Romance, Comedy, Animation. In the `titles` table, you can find the genres in the `genres` column. There may be some overlap (i.e. a movie may have more than one genre), this is ok. - -To query rows which have the genre Action as one of its genres: - -[source,SQL] ----- -SELECT * FROM titles WHERE genres LIKE '%action%'; ----- - -.Items to submit -==== -- Any code you used to solve the problem in a code chunk. -- The average rating of each of the genres listed for movies. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc deleted file mode 100644 index fd0ebf83a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project01.adoc +++ /dev/null @@ -1,170 +0,0 @@ -= STAT 39000: Project 1 -- Fall 2020 - -**Motivation:** In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into. - -**Context:** We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we've previously learned. - -**Scope:** data wrangling in R, functions - -.Learning objectives -**** -- Comprehend what a function is, and the components of a function in R. -- Read and write basic (csv) data. -- Utilize apply functions in order to solve a data-driven problem. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://thedatamine.github.io/the-examples-book - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -[IMPORTANT] -==== -It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials. -==== - -We decided to move away from ThinLinc and away from the version of RStudio used last year (https://desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks. - -Remember the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in. - -You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run: - -[source,r] ----- -help(package=ggplot2) ----- - -Sometimes it can be helpful to see the source code of a defined function. A https://www.tutorialspoint.com/r/r_functions.htm[function] is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function's name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run: - -[source,r] ----- -Reduce ----- - -Occasionally this will be less useful as the resulting code will be code that calls `c` code we can't see. Other times it will allow you to understand the function better. - -== Dataset: - -`/class/datamine/data/airbnb` - -Often times (maybe even the majority of the time) data doesn't come in one nice file or database. Explore the datasets in `/class/datamine/data/airbnb`. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -You may have noted that, for each country, city, and date we can find 3 files: `calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz` (for now, we will ignore all files in the "visualisations" folders). - -Let's take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (`calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz`). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them. - -[TIP] -==== -`read.csv` has an argument to select the number of rows we want to read. -==== - -[TIP] -==== -Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries. -==== - -To read a compressed csv, simply use the `read.csv` function: - -[source,r] ----- -dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz") -head(dat) ----- - -Let's work towards getting this data into an easier format to analyze. From now on, we will focus on the `listings.csv.gz` datasets. - -.Items to submit -==== -- Chunk of code used to read the first 50 rows of each dataset. -- 1-2 sentences briefly describing the information contained in each dataset. -- Name(s) of variable(s) that could be used to join them. -==== - -=== Question 2 - -Write a function called `get_paths_for_country`, that, given a string with the country name, returns a vector with the full paths for all `listings.csv.gz` files, starting with `/class/datamine/data/airbnb/...`. - -For example, the output from `get_paths_for_country("united-states")` should have 28 entries. Here are the first 5 entries in the output: - ----- - [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz" - [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz" - [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz" - [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz" - [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz" ----- - -[TIP] -==== -`list.files` is useful with the `recursive=T` option. -==== - -[TIP] -==== -Use `grep` to search for the pattern `listings.csv.gz` (within the results from the first hint), and use the option `value=T` to display the values found by the `grep` function. -==== - -.Items to submit -==== -- Chunk of code for your `get_paths_for_country` function. -==== - -=== Question 3 - -Write a function called `get_data_for_country` that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you. - -[TIP] -==== -Use `stringsAsFactors=F` in the `read.csv` function. -==== - -[TIP] -==== -Use `do.call(rbind, <listofdataframes>)` to combine a list of dataframes into a single dataframe. -==== - -.Items to submit -==== -- Chunk of code for your `get_data_for_country` function. -==== - -=== Question 4 - -Use your `get_data_for_country` to get the data for a country of your choice, and make sure to name the data.frame `listings`. Take a look at the following columns: `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, and `is_location_exact`. What is the data type for each column? (You can use `class` or `typeof` or `str` to see the data type.) - -These columns would make more sense as logical values (TRUE/FALSE/NA). - -Write a function called `transform_column` that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (`""`), and we need to be careful when transforming the data. Test your function on column `host_is_superhost`. - -.Items to submit -==== -- Chunk of code for your `transform_column` function. -- Type of `transform_column(listings$host_is_superhost)`. -==== - -=== Question 5 - -Create a histogram for response rates (`host_response_rate`) for super hosts (where `host_is_superhost` is `TRUE`). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert `host_response_rate` from a character containing "%" signs to a numeric variable. - -.Items to submit -==== -- Chunk of code used to answer the question. -- Histogram of response rates for super hosts. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc deleted file mode 100644 index 997cc587c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project02.adoc +++ /dev/null @@ -1,198 +0,0 @@ -= STAT 39000: Project 2 -- Fall 2020 - -**Motivation:** The ability to quickly reproduce an analysis is important. It is often necessary that other individuals will need to be able to understand and reproduce an analysis. This concept is so important there are classes solely on reproducible research! In fact, there are papers that investigate and highlight the lack of reproducibility in various fields. If you are interested in reading about this topic, a good place to start is the paper titled https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124["Why Most Published Research Findings Are False"] by John Ioannidis (2005). - -**Context:** Making your work reproducible is extremely important. We will focus on the computational part of reproducibility. We will learn RMarkdown to document your analyses so others can easily understand and reproduce the computations that led to your conclusions. Pay close attention as future project templates will be RMarkdown templates. - -**Scope:** Understand Markdown, RMarkdown, and how to use it to make your data analysis reproducible. - -.Learning objectives -**** -- Use Markdown syntax within an Rmarkdown document to achieve various text transformations. -- Use RMarkdown code chunks to display and/or run snippets of code. -**** - -== Questions - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_8rsq5yrn&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_bjrv34ss"></iframe> -++++ - -=== Question 1 - -Make the following text (including the asterisks) bold: `This needs to be **very** bold`. Make the following text (including the underscores) italicized: `This needs to be _very_ italicized.` - -[IMPORTANT] -==== -Surround your answer in 4 backticks. This will allow you to display the markdown _without_ having the markdown "take effect". For example: - -`````markdown -```` -Some *marked* **up** text. -```` -````` -==== - -[TIP] -==== -Be sure to check out the https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf[Rmarkdown Cheatsheet] and our section on https://thedatamine.github.io/the-examples-book/r.html#r-rmarkdown[Rmarkdown in the book]. -==== - -[NOTE] -==== -Rmarkdown is essentially Markdown + the ability to run and display code chunks. In this question, we are actually using Markdown within Rmarkdown! -==== - - -.Items to submit -==== -- 2 lines of markdown text, surrounded by 4 backticks. Note that when compiled, this text will be unmodified, regular text. -==== - -=== Question 2 - -Create an unordered list of your top 3 favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another *ordered* list that ranks your academic interests in order of most interested to least interested. - -[TIP] -==== -You can learn what ordered and unordered lists are [here](https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf). -==== - -[NOTE] -==== -Similar to (1), in this question we are dealing with Markdown. If we were to copy and paste the solution to this problem in a Markdown editor, it would be the same result as when we Knit it here. -==== - -.Items to submit -==== -- Create the lists, this time don't surround your code in backticks. Note that when compiled, this text will appear as nice, formatted lists. -==== - -=== Question 3 - -Browse https://www.linkedin.com/ and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown. Include the following: - -- A header for this section (your choice of size) that says "About". -- The text of your personal "About" section that you would feel comfortable uploading to linkedin, including at least 1 link. - -.Items to submit -==== -- Create the described profile, don't surround your code in backticks. -==== - -=== Question 4 - -LaTeX is a powerful editing tool where you can create beautifully formatted equations and formulas. Replicate the equation found https://wikimedia.org/api/rest_v1/media/math/render/svg/87c061fe1c7430a5201eef3fa50f9d00eac78810[here] as closely as possible. - -[TIP] -==== -Lookup "latex mid" and "latex frac". -==== - -.Items to submit -==== -- Replicate the equation using LaTeX under the Question 4 header in your template. -==== - -=== Question 5 - -Your co-worker wrote a report, and has asked you to beautify it. Knowing Rmarkdown, you agreed. Make improvements to this section. At a minimum: - -- Make the title pronounced. -- Make all links appear as a word or words, rather than the long-form URL. -- Organize all code into code chunks where code and output are displayed. If the output is really long, just display the code. -- Make the calls to the `library` function be evaluated but not displayed. -- Make sure all warnings and errors that may eventually occur, do not appear in the final document. - -Feel free to make any other changes that make the report more visually pleasing. - -````markdown -`r ''````{r my-load-packages} -library(ggplot2) -``` - -`r ''````{r declare-variable-390, eval=FALSE} -my_variable <- c(1,2,3) -``` - -All About the Iris Dataset - -This paper goes into detail about the `iris` dataset that is built into r. You can find a list of built-in datasets by visiting https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html or by running the following code: - -data() - -The iris dataset has 5 columns. You can get the names of the columns by running the following code: - -names(iris) - -Alternatively, you could just run the following code: - -iris - -The second option provides more detail about the dataset. - -According to https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html there is another dataset built-in to r called `iris3`. This dataset is 3 dimensional instead of 2 dimensional. - -An iris is a really pretty flower. You can see a picture of one here: - -https://www.gardenia.net/storage/app/public/guides/detail/83847060_mOptimized.jpg - -In summary. I really like irises, and there is a dataset in r called `iris`. -```` - -.Items to submit -==== -- Make improvements to this section, and place it all under the Question 5 header in your template. -==== - -=== Question 6 - -Create a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`, and display the plot using a code chunk. Make sure the code used to generate the plot is hidden. Include a descriptive caption for the image. Make sure to use an RMarkdown chunk option to create the caption. - -.Items to submit -==== -- Code chunk under that creates and displays a plot using a built-in dataset like `iris`, `mtcars`, or `Titanic`. -==== - -=== Question 7 - -Insert the following code chunk under the Question 7 header in your template. Try knitting the document. Two things will go wrong. What is the first problem? What is the second problem? - -````markdown -```{r my-load-packages}`r ''` -plot(my_variable) -``` -```` - -[TIP] -==== -Take a close look at the name we give our code chunk. -==== - -[TIP] -==== -Take a look at the code chunk where `my_variable` is declared. -==== - -.Items to submit -==== -- The modified version of the inserted code that fixes both problems. -- A sentence explaining what the first problem was. -- A sentence explaining what the second problem was. -==== - -=== For Project 2, please submit your .Rmd file and the resulting .pdf file. (For this project, you do not need to submit a .R file.) - -=== OPTIONAL QUESTION - -RMarkdown is also an excellent tool to create a slide deck. Use the information https://rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf[here] or https://thedatamine.github.io/the-examples-book/r.html#how-do-i-create-a-set-of-slides-using-rmarkdown[here] to convert your solutions into a slide deck rather than the regular PDF. You may experiment with `slidy`, `ioslides` or `beamer`, however, make your final set of solutions use `beamer` as the output is a PDF. Make any needed modifications to make the solutions knit into a well-organized slide deck (For example, include slide breaks and make sure the contents are shown completely.). Modify (2) so the bullets are incrementally presented as the slides progress. - -[IMPORTANT] -==== -You do _not_ need to submit the original PDF for this project, just the `beamer` slide version of the PDF. -==== - -.Items to submit -==== -- The modified version of the solutions in `beamer` slide form. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc deleted file mode 100644 index 6ae98fe85..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project03.adoc +++ /dev/null @@ -1,214 +0,0 @@ -= STAT 39000: Project 3 -- Fall 2020 - -**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful `bash` tools, help you navigate a filesystem, and even run `bash` tools from within an RMarkdown file in RStudio. - -**Context:** At this point in time, you will each have varying levels of familiarity with Scholar. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within RStudio in an RMarkdown file. - -**Scope:** bash, RStudio - -.Learning objectives -**** -- Distinguish differences in /home, /scratch, and /class. -- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc. -- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc. -- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc. -- Utilize other Scholar resources: rstudio.scholar.rcac.purdue.edu, notebook.scholar.rcac.purdue.edu, desktop.scholar.rcac.purdue.edu, etc. -- Use `man` to read and learn about UNIX utilities. -- Run `bash` commands from within and RMarkdown file in RStudio. -**** - -There are a variety of ways to connect to Scholar. In this class, we will _primarily_ connect to RStudio Server by opening a browser and navigating to https://rstudio.scholar.rcac.purdue.edu/, entering credentials, and using the excellent RStudio interface. - -Here is a video to remind you about some of the basic tools you can use in UNIX/Linux: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_9mz5s0wd&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_0y4x1feo"></iframe> -++++ - -This is the easiest book for learning this stuff; it is short and gets right to the point: - -https://learning.oreilly.com/library/view/learning-the-unix/0596002610 - -You just log in and you can see it all; we suggest Chapters 1, 3, 4, 5, 7 (you can basically skip chapters 2 and 6 the first time through). - -It is a very short read (maybe, say, 2 or 3 hours altogether?), just a thin book that gets right to the details. - -== Questions - -=== Question 1 - -Navigate to https://rstudio.scholar.rcac.purdue.edu/ and login. Take some time to click around and explore this tool. We will be writing and running Python, R, SQL, and `bash` all from within this interface. Navigate to `Tools > Global Options ...`. Explore this interface and make at least 2 modifications. List what you changed. - -Here are some changes Kevin likes: - -- Uncheck "Restore .Rdata into workspace at startup". -- Change tab width 4. -- Check "Soft-wrap R source files". -- Check "Highlight selected line". -- Check "Strip trailing horizontal whitespace when saving". -- Uncheck "Show margin". - -(Dr. Ward does not like to customize his own environment, but he does use the emacs key bindings: Tools > Global Options > Code > Keybindings, but this is only recommended if you already know emacs.) - -.Items to submit -==== -- List of modifications you made to your Global Options. -==== - -=== Question 2 - -There are four primary panes, each with various tabs. In one of the panes there will be a tab labeled "Terminal". Click on that tab. This terminal by default will run a `bash` shell right within Scholar, the same as if you connected to Scholar using ThinLinc, and opened a terminal. Very convenient! - -What is the default directory of your bash shell? - -[TIP] -==== -Start by reading the section on `man`. `man` stands for manual, and you can find the "official" documentation for the command by typing `man <command_of_interest>`. For example: -==== - -[source,bash] ----- -# read the manual for the `man` command -# use "k" or the up arrow to scroll up, "j" or the down arrow to scroll down -man man ----- - -.Items to submit -==== -- The full filepath of default directory (home directory). Ex: Kevin's is: `/home/kamstut` -- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched. -==== - -=== Question 3 - -Learning to navigate away from our home directory to other folders, and back again, is vital. Perform the following actions, in order: - -- Write a single command to navigate to the folder containing our full datasets: `/class/datamine/data`. -- Write a command to confirm you are in the correct folder. -- Write a command to list the files and directories within the data directory. (You do not need to recursively list subdirectories and files contained therein.) What are the names of the files and directories? -- Write another command to return back to your home directory. -- Write a command to confirm you are in the correct folder. - -Note: `/` is commonly referred to as the root directory in a linux/unix filesystem. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the full filepath of Kevin's home directory. There is a folder `home` inside the root directory. Inside `home` is another folder named `kamstut` which is Kevin's home directory. - -.Items to submit -==== -- Command used to navigate to the data directory. -- Command used to confirm you are in the data directory. -- Command used to list files and folders. -- List of files and folders in the data directory. -- Command used to navigate back to the home directory. -- Commnad used to confirm you are in the home directory. -==== - -=== Question 4 - -Let's learn about two more important concepts. `.` refers to the current working directory, or the directory displayed when you run `pwd`. Unlike `pwd` you can use this when navigating the filesystem! So, for example, if you wanted to see the contents of a file called `my_file.txt` that lives in `/home/kamstut` (so, a full path of `/home/kamstut/my_file.txt`), and you are currently in `/home/kamstut`, you could run: `cat ./my_file.txt`. - -`..` represents the parent folder or the folder in which your current folder is contained. So let's say I was in `/home/kamstut/projects/` and I wanted to get the contents of the file `/home/kamstut/my_file.txt`. You could do: `cat ../my_file.txt`. - -When you navigate a directory tree using `.` and `..` you create paths that are called _relative_ paths because they are _relative_ to your current directory. Alternatively, a _full_ path or (_absolute_ path) is the path starting from the root directory. So `/home/kamstut/my_file.txt` is the _absolute_ path for `my_file.txt` and `../my_file.txt` is a _relative_ path. Perform the following actions, in order: - -- Write a single command to navigate to the data directory. -- Write a single command to navigate back to your home directory using a _relative_ path. Do not use `~` or the `cd` command without a path argument. - -.Items to submit -==== -- Command used to navigate to the data directory. -- Command used to navigate back to your home directory that uses a _relative_ path. -==== - -=== Question 5 - -In Scholar, when you want to deal with _really_ large amounts of data, you want to access scratch (you can read more https://www.rcac.purdue.edu/policies/scholar/[here]). Your scratch directory on Scholar is located here: `/scratch/scholar/$USER`. `$USER` is an environment variable containing your username. Test it out: `echo /scratch/scholar/$USER`. Perform the following actions: - -- Navigate to your scratch directory. -- Confirm you are in the correct location. -- Execute `myquota`. -- Find the location of the `myquota` bash script. -- Output the first 5 and last 5 lines of the bash script. -- Count the number of lines in the bash script. -- How many kilobytes is the script? - -[TIP] -==== -You could use each of the commands in the relevant topics once. -==== - -[TIP] -==== -When you type `myquota` on Scholar there are sometimes two warnings about `xauth` but sometimes there are no warnings. If you get a warning that says `Warning: untrusted X11 forwarding setup failed: xauth key data not generated` it is safe to ignore this error. -==== - -[TIP] -==== -Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the _options_ of a command in the `DESCRIPTION` section of the `man` pages. For example: `man wc`. You can see `-m`, `-l`, and `-w` are all options for `wc`. To test this out: - -[source,bash] ----- -# using the default wc command. "/class/datamine/data/flights/1987.csv" is the first "argument" given to the command. -wc /class/datamine/data/flights/1987.csv -# to count the lines, use the -l option -wc -l /class/datamine/data/flights/1987.csv -# to count the words, use the -w option -wc -w /class/datamine/data/flights/1987.csv -# you can combine options as well -wc -w -l /class/datamine/data/flights/1987.csv -# some people like to use a single tack `-` -wc -wl /class/datamine/data/flights/1987.csv -# order doesn't matter -wc -lw /class/datamine/data/flights/1987.csv ----- -==== - -[TIP] -==== -The `-h` option for the `du` command is useful. -==== - -.Items to submit -==== -- Command used to navigate to your scratch directory. -- Command used to confirm your location. -- Output of `myquota`. -- Command used to find the location of the `myquota` script. -- Absolute path of the `myquota` script. -- Command used to output the first 5 lines of the `myquota` script. -- Command used to output the last 5 lines of the `myquota` script. -- Command used to find the number of lines in the `myquota` script. -- Number of lines in the script. -- Command used to find out how many kilobytes the script is. -- Number of kilobytes that the script takes up. -==== - -=== Question 6 - -Perform the following operations: - -- Navigate to your scratch directory. -- Copy and paste the file: `/class/datamine/data/flights/1987.csv` to your current directory (scratch). -- Create a new directory called `my_test_dir` in your scratch folder. -- Move the file you copied to your scratch directory, into your new folder. -- Use `touch` to create an empty file named `im_empty.txt` in your scratch folder. -- Remove the directory `my_test_dir` _and_ the contents of the directory. -- Remove the `im_empty.txt` file. - -[TIP] -==== -`rmdir` may not be able to do what you think, instead, check out the options for `rm` using `man rm`. -==== - -.Items to submit -==== -- Command used to navigate to your scratch directory. -- Command used to copy the file, `/class/datamine/data/flights/1987.csv` to your current directory (scratch). -- Command used to create a new directory called `my_test_dir` in your scratch folder. -- Command used to move the file you copied earlier `1987.csv` into your new `my_test_dir` folder. -- Command used to create an empty file named `im_empty.txt` in your scratch folder. -- Command used to remove the directory _and_ the contents of the directory `my_test_dir`. -- Command used to remove the `im_empty.txt` file. -==== - -=== Question 7 - -Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc deleted file mode 100644 index 468cab60d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project04.adoc +++ /dev/null @@ -1,193 +0,0 @@ -= STAT 39000: Project 4 -- Fall 2020 - -**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. `grep` is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in. - -**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python. - -**Scope:** grep, regular expression basics, utilizing regular expression tools in R and Python - -.Learning objectives -**** -- Use `grep` to search for patterns within a dataset. -- Use `cut` to section off and slice up data from the command line. -- Use `wc` to count the number of lines of input. -**** - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://the-examples-book.com/book/ - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -[IMPORTANT] -==== -I would highly recommend using single quotes `'` to surround your regular expressions. Double quotes can have unexpected behavior due to some shell's expansion rules. In addition, pay close attention to escaping certain https://unix.stackexchange.com/questions/20804/in-a-regular-expression-which-characters-need-escaping[characters] in your regular expressions. -==== - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/movies_and_tv/the_office_dialogue.csv` - -A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/movies-and-tv/the_office_dialogue.csv[the_office_dialogue.csv] - -Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset. - -`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. You can read about and see examples of `grep` https://thedatamine.github.io/the-examples-book/unix.html#grep[here]. - -== Question - -=== Question 1 - -Login to Scholar and use `grep` to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located? - -.Items to submit -==== -- The `grep` command used to find the dataset. -- The name and location in Scholar of the dataset. -- Use `grep` and `grepl` within R to solve a data-driven problem. -==== - -=== Question 2 - -`grep` prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first _n_ lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in. - -Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself. - -[TIP] -==== -The result if you were to search for "bears. beets. battlestar galactica." should be: - ----- -"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica." ----- -==== - -[TIP] -==== -One method to solve this problem would be to pipe the output from `grep` to `cut`. -==== - -.Items to submit -==== -- The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.". -==== - -=== Question 3 - -Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley". - -[TIP] -==== -A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with `grep` make sure to add the -P option. In addition, make sure to use single quotes to make sure your regular expression is taken literally. If you use double quotes, variables are expanded. -==== - -Regular expressions are really a useful semi-language-agnostic tool. What this means is regardless of the programming language you are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using `read.csv`. Name the resulting data.frame `dat`. - -.Items to submit -==== -- The UNIX command used to solve this problem. -==== - -=== Question 4 - -The `text_w_direction` column in `dat` contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions. - -(a) Create a new column called `has_direction` that is set to `TRUE` if the `text_w_direction` column has direction, and `FALSE` otherwise. Use the `grepl` function in R to accomplish this. - -[TIP] -==== -Make sure all opening brackets "[" have a corresponding closing bracket "]". -==== - -[TIP] -==== -Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text. -==== - -(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5. - -We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences. - ----- -This is a line with [emphasize this] only 1 direction! -This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug]. ----- - -Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match. - -In our actual dataset, for example, `dat$text_w_direction[2789]` is a line with 2 directions. - -.Items to submit -==== -- The R code and regular expression used to solve the first part of this problem. -- The R code and regular expression used to solve the second part of this problem. -- How many lines have >= 2 directions? -- How many lines have >= 5 directions? -==== - -=== Question 5 - -Use the `str_extract_all` function from the `stringr` package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called `direction`. - ----- -This is a line with [emphasize this] only 1 direction! -This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug]. ----- - -In this question, your solution may have extracted: - ----- -[emphasize this] -[emphasize this] 2 sets of direction, do you see the difference [shrug] ----- - -It is okay to keep the text between neighboring pairs of "[" and "]" for the second line. - -.Items to submit -==== -- The R code used to solve this problem. -==== - -=== OPTIONAL QUESTION - -Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called `direction_correct`. You can test to see if it is working by running the following code: - -```{r, eval=F} -dat$direction_correct[747] -``` - ----- -This is a line with [emphasize this] only 1 direction! -This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug]. ----- - -In (5), your solution may have extracted: - ----- -[emphasize this] -[emphasize this] 2 sets of direction, do you see the difference [shrug] ----- - -This is ok for (5). In this question, however, we want to fix this to only extract: - ----- -[emphasize this] -[emphasize this] [shrug] ----- - -[TIP] -==== -This regular expression will be hard to read. -==== - -[TIP] -==== -The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket. -==== - -.Items to submit -==== -- The R code used to solve this problem. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc deleted file mode 100644 index e53794127..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project05.adoc +++ /dev/null @@ -1,171 +0,0 @@ -= STAT 39000: Project 5 -- Fall 2020 - -**Motivation:** Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. - -**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping. - -**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping - -.Learning objectives -**** -- Use `cut` to section off and slice up data from the command line. -- Use piping to string UNIX commands together. -- Use `sort` and it's options to sort data in different ways. -- Use `head` to isolate _n_ lines of output. -- Use `wc` to summarize the number of lines in a file or in output. -- Use `uniq` to filter out non-unique lines. -- Use `grep` to search files effectively. -**** - -You can find useful examples that walk you through relevant material in The Examples Book: - -https://the-examples-book.com/book/ - -It is highly recommended to read through, search, and explore these examples to help solve problems in this project. - -Don't forget the very useful documentation shortcut `?` for R code. To use, simply type `?` in the console, followed by the name of the function you are interested in. In the Terminal, you can use the `man` command to check the documentation of `bash` code. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/amazon/amazon_fine_food_reviews.csv` - -A public sample of the data can be found here: https://www.datadepot.rcac.purdue.edu/datamine/data/amazon/amazon_fine_food_reviews.csv[amazon_fine_food_reviews.csv] - -Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset. - -Here are three videos that might also be useful, as you work on Project 5: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_033gzti4&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_s3x23xpl"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_b3pvmwfh&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_b01m3m83"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_wf3zmtmy&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_v55nhwhp"></iframe> -++++ - - -== Questions - -=== Question 1 - -What is the `Id` of the most helpful review, according to the highest `HelpfulnessNumerator`? - -[IMPORTANT] -==== -You can always pipe output to `head` in case you want the first few values of a lot of output. Note that if you used `sort` before `head`, you may see the following error messages: - ----- -sort: write failed: standard output: Broken pipe -sort: write error ----- - -This is because `head` would truncate the output from `sort`. This is okay. See https://stackoverflow.com/questions/46202653/bash-error-in-sort-sort-write-failed-standard-output-broken-pipe[this discussion] for more details. -==== - -.Items to submit -==== -- Line of UNIX commands used to solve the problem. -- The `Id` of the most helpful review. -==== - -=== Question 2 - -Some entries under the `Summary` column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion. - -To further clarify what we mean by _unique_, if we had the following vector in R, `c("a", "b", "a", "c")`, its unique values are `c("a", "b", "c")`. - -.Items to submit -==== -- Two lines of UNIX commands used to solve the problem. -- The ratio of unique `Summary`'s. -==== - -=== Question 3 - -Use a chain of UNIX commands, piped in a sequence, to create a frequency table of `Score`. - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -- The frequency table. -==== - -=== Question 4 - -Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why? - -[TIP] -==== -You may need to pipe the output to `sort` multiple times. -==== - -[TIP] -==== -To create the frequency table, read through the `man` pages for `uniq`. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following: - -[source,bash] ----- -man uniq ----- -==== - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -- The frequency table. -==== - -=== Question 5 - -Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let's consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let's consider anything in between neither strongly positive nor negative. - -[TIP] -==== -You may find the solution to problem (3) useful. -==== - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -==== - -=== Question 6 - -Find the most helpful review with a `Score` of 5. Then (separately) find the most helpful review with a `Score` of 1. As before, we are considering the most helpful review to be the review with the highest `HelpfulnessNumerator`. - -[TIP] -==== -You can use multiple lines to solve this problem. -==== - -.Items to submit -==== -- The lines of UNIX commands used to solve the problem. -- `ProductId`'s of both requested reviews. -==== - -=== Question 7 - -For *only* the two `ProductId` from the previous question, create a new dataset called `scores.csv` that contains all `ProductId` and `Score` from all reviews for these two items. - -.Items to submit -==== -- The line of UNIX commands used to solve the problem. -==== - -=== OPTIONAL QUESTION - -Use R to load up `scores.csv` into a new data.frame called `dat`. Create a histogram for each products' `Score`. Compare the most helpful review `Score` with those given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be. - -.Items to submit -==== -- R code used to create the histograms. -- 3 histograms, 1 for each `ProductId`. -- 1-2 sentences describing the curious pattern that you would like to further explore. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc deleted file mode 100644 index 4e5c64910..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project06.adoc +++ /dev/null @@ -1,215 +0,0 @@ -= STAT 39000: Project 6 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -- Use output created from the terminal to create a plot using R. -**** - -== Dataset - -The following questions will use the dataset found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/YYYY.csv[here] or in Scholar: - -`/class/datamine/data/flights/subset/YYYY.csv` - -An example from 1987 data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here] or in Scholar: - -`/class/datamine/data/flights/subset/1987.csv` - -== Questions - -=== Question 1 - -In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the `Origin`, from `1987.csv`. Write another line, this time using `awk` to do the same thing. Which one do you prefer, and why? - -Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_hmf7lr7b&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_6tkg6zzx"></iframe> -++++ - -.Items to submit -==== -- One line of UNIX commands to solve the problem *without* using `awk`. -- One line of UNIX commands to solve the problem using `awk`. -- 1-2 sentences describing which method you prefer and why. -==== - -=== Question 2 - -Write a bash script that accepts a year (1987, 1988, etc.) and a column *n* and returns the *nth* column of the associated year of data. - -Here are two examples to illustrate how to write a bash script: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_gkirnxfb&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_0qtbjjlt"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_e14gbfiq&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_8tyncb6q"></iframe> -++++ - -In this example, you only need to turn in the content of your bash script (starting with `#!/bin/bash`) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it: - -```{bash, eval=F} -#!/bin/bash -echo "First argument: $1" -echo "Second argument: $2" -``` - -If you simply drop that text into a file called `my_script.sh`, located here: `/home/$USER/my_script.sh`, and if you run the following: - -```{bash, eval=F} -# Setup bash to run; this only needs to be run one time per session. -# It makes bash behave a little more naturally in RStudio. -exec bash -# Navigate to the location of my_script.sh -cd /home/$USER -# Make sure that the script is runable. -# This only needs to be done one time for each new script that you write. -chmod 755 my_script.sh -# Execute my_script.sh -./my_script.sh okay cool -``` - -then it will print: - ----- -First argument: okay -Second argument: cool ----- - -In this example, if we were to turn in the "content of your bash script (starting with `#!/bin/bash`) in a code chunk, our solution would look like this: - -```{bash, eval=F} -#!/bin/bash -echo "First argument: $1" -echo "Second argument: $2" -``` - -And although we aren't running the code chunk above, we know that it works because we tested it in the terminal. - -[TIP] -==== -Using `awk` you could have a script with just two lines: 1 with the "hash-bang" (`#!/bin/bash`), and 1 with a single `awk` command. -==== - -.Items to submit -==== -- The content of your bash script (starting with `#!/bin/bash`) in a code chunk. -==== - -=== Question 3 - -How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using `awk`, then solve this problem using *only* `awk`. - -Here is a similar example, using the election data set: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_mzv1gtb1&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_mv06yedm"></iframe> -++++ - -.Items to submit -==== -- One line of UNIX commands to solve the problem *without* using `awk`. -- One line of UNIX commands to solve the problem using `awk`. -- The number of flights that arrived at Indianapolis (IND) in 2008. -==== - -=== Question 4 - -Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you'd like. Are they indeed the same? How many unique values do we have per category (`Origin`, `Dest`)? - -Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_7vly78sw&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_tba174p2"></iframe> -++++ - -.Items to submit -==== -- 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same. -- The UNIX command(s) used to figure out if the number of unique origins and destinations are the same. -- The number of unique values per category (`Origin`, `Dest`). -==== - -=== Question 5 - -In (4) we found that there are not the same number of unique `Origin` as `Dest`. Find the https://en.wikipedia.org/wiki/International_Air_Transport_Association_code#Airport_codes[IATA airport code] for all `Origin` that don't appear in a `Dest` and all `Dest` that don't appear in an `Origin` in the 2008 data. - -[TIP] -==== -The examples on https://www.tutorialspoint.com/unix_commands/comm.htm[this] page should help. Note that these examples are based on https://tldp.org/LDP/abs/html/process-sub.html[Process Substitution], which basically allows you to specify commands whose output would be used as the input of `comm`. There should be no space between the open bracket and open parenthesis, otherwise your bash will not work as intended. -==== - -.Items to submit -==== -- The line(s) of UNIX command(s) used to answer the question. -- The list of all `Origin` that don't appear in `Dest`. -- The list of all `Dest` that don't appear in `Origin`. -==== - -=== Question 6 - -What was the percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND"? What percentage of flights had "PHX" as `Origin` (among all flights with `Dest` of "IND")? - -Here is an example using the percentages of donations contributed from CEOs from various States: - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_4r4bx3by&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_43qkeojx"></iframe> -++++ - -[TIP] -==== -You can do the mean calculation in awk by dividing the result from (3) by the number of unique `Origin` that have a `Dest` of "IND". -==== - -.Items to submit -==== -- The percentage of flights in 2008 per unique `Origin` with the `Dest` of "IND". -- 1-2 sentences explaining how "PHX" compares (as a unique `ORIGIN`) to the other `Origin`s (all with the `Dest` of "IND")? -==== - -=== Question 7 - -Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like: - ----- -1987, 12345 -1988, 44 ----- - -Run the script with inputs: `1991` and `ORD`. Include the output in your submission. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The output of the script given `1991` and `ORD` as inputs. -==== - -=== OPTIONAL QUESTION 1 - -Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called `my_airport.csv`. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -==== - -=== OPTIONAL QUESTION 2 - -In R, load `my_airport.csv` and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations. - -.Items to submit -==== -- Line chart showing year-by-year change in flights into and out of the chosen airport. -- R code used to create the chart. -- 1-2 sentences with your observations. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc deleted file mode 100644 index 1d0623492..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project07.adoc +++ /dev/null @@ -1,153 +0,0 @@ -= STAT 39000: Project 7 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -== Dataset: - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/flights/subset/YYYY.csv` - -An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here]. - -Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -Write a line of code that prints a list of the unique values in the `DayOfWeek` column. Write a line of code that prints a list of the unique values in the `DayOfMonth` column. Write a line of code that prints a list of the unique values in the `Month` column. Use the `1987.csv` dataset. Are the results what you expected? - -.Items to submit -==== -- 3 lines of code used to get a list of unique values for the chosen columns. -- 1-2 sentences explaining whether or not the results are what you expected. -==== - -=== Question 2 - -Our files should have 29 columns. For a given file, write a line of code that prints any lines that do *not* have 29 columns. Test it on `1987.csv`, were there any rows without 29 columns? - -[TIP] -==== -Checking built-in variables for `awk`, we see that `NF` may be useful! -==== - -.Items to submit -==== -- Line of code used to solve the problem. -- 1-2 sentences explaining whether or not there were any rows without 29 columns. -==== - -=== Question 3 - -Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do *not* have 29 columns. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The results of running your bash scripts from year 1987 to 2008. -==== - -=== Question 4 - -`awk` is a really good tool to quickly get some data and manipulate it a little bit. The column `Distance` contains the distances of the flights in miles. Use `awk` to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344. - -Below is some example output: - ----- -Miles: 12345 -Kilometers: 19867.35168 ----- - -.Items to submit -==== -- The code used to solve the problem. -- The results of running the code. -==== - -=== Question 5 - -Use `awk` to calculate the sum of the number of `DepDelay` minutes, grouped according to `DayOfWeek`. Use `2007.csv`. - -Below is some example output: - -```txt -DayOfWeek: 0 -1: 1234567 -2: 1234567 -3: 1234567 -4: 1234567 -5: 1234567 -6: 1234567 -7: 1234567 -``` - -[NOTE] -==== -1 is Monday. -==== - -.Items to submit -==== -- The code used to solve the problem. -- The output from running the code. -==== - -=== Question 6 - -It wouldn't be fair to compare the total `DepDelay` minutes by `DayOfWeek` as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of `DepDelay` minutes by the number of flights per `DayOfWeek`. Use `2007.csv`. - -Below is some example output: - -```txt -DayOfWeek: 0 -1: 1.234567 -2: 1.234567 -3: 1.234567 -4: 1.234567 -5: 1.234567 -6: 1.234567 -7: 1.234567 -``` - -.Items to submit -==== -- The code used to solve the problem. -- The output from running the code. -==== - -=== Question 7 - -Anyone who has flown knows how frustrating it can be waiting for takeoff, or deboarding the aircraft. These roughly translate to `TaxiOut` and `TaxiIn` respectively. If you were to fly into or out of IND what is your expected total taxi time? Use `2007.csv`. - -[NOTE] -==== -Taxi times are in minutes. -==== - -.Items to submit -==== -- The code used to solve the problem. -- The output from running the code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc deleted file mode 100644 index 8bbcb1036..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project08.adoc +++ /dev/null @@ -1,148 +0,0 @@ -= STAT 39000: Project 8 -- Fall 2020 - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, UNIX utilities, bash scripts - -.Learning objectives -**** -- Use `awk` to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -== Dataset: - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/flights/subset/YYYY.csv` - -An example of the data for the year 1987 can be found https://www.datadepot.rcac.purdue.edu/datamine/data/flights/subset/1987.csv[here]. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -=== Question 1 - -Let's say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true! - -Write a line of `awk` code that, prints the _total_ number of flights that occur on weekend days, followed by the _total_ number of flights that occur on the weekdays. Complete this calculation for 2008 using the `2008.csv` file. - -Modify your code to instead print the average number of flights that occur on weekend days, followed by the average number of flights that occur on the weekdays. - -[TIP] -==== -You don't need a large if statement to do this, you can use the `~` comparison operator. -==== - -.Items to submit -==== -- Lines of `awk` code that solves the problem. -- The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008. -- The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008. -==== - -=== Question 2 - -We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let's use the tools we've learned to explore this a little bit. - -Take a look at `airports.csv`. In particular run the following: - -```{bash, eval=F} -head airports.csv -``` - -Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within `awk`. This is messy and we would prefer to create a new file called `new_airports.csv` without any quotes. Write a line of code to do this. - -[NOTE] -==== -You may be wondering *why* we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It's important to practice and learn ways to fix these things. -==== - -[TIP] -==== -You could use `gsub` within `awk` to replace '"' with ''. You can find how to use `gsub` https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html[here]. -==== - -[TIP] -==== -If you leave out the column number argument to `gsub` it will apply the substitution to every field in every column. -==== - -[TIP] -==== -```{bash, eval=F} -cat new_airports.csv | wc -l # should be 159 without header -``` -==== - -.Items to submit -==== -- Line of `awk` code used to create the new dataset. -==== - -=== Question 3 - -Write a line of commands that creates a new dataset called `az_fl_airports.txt`. `az_fl_airports.txt` should _only_ contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),`new_airports.csv` as a starting point. - -How many airports are there? Did you expect this? Use a line of bash code to count this. - -Create a new dataset (called `az_fl_flights.txt`) that contains all of the data for flights into or out of Florida and Arizona (using the `2008.csv` file). Use the newly created dataset, `az_fl_airports.txt` to accomplish this. - -[TIP] -==== -https://unix.stackexchange.com/questions/293684/basic-grep-awk-help-extracting-all-lines-containing-a-list-of-terms-from-one-f -==== - -[TIP] -==== -```{bash, eval=F} -cat az_fl_flights.txt | wc -l # should be 484705 -``` -==== - -.Items to submit -==== -- All UNIX commands used to answer the questions. -- The number of airports. -- 1-2 sentences explaining whether you expected this number of airports. -==== - -=== Question 4 - -Write a bash script that accepts the start year, end year, and filename containing airport codes (`az_fl_airports.txt`), and outputs the data for flights into or out of any of the airports listed in the provided filename (`az_fl_airports.txt`). The script should output data for flights using _all_ of the years of data in the provided range. Run the bash script to create a new file called `az_fl_flights_total.csv`. - -.Items to submit -==== -- The content of your bash script (starting with "#!/bin/bash") in a code chunk. -- The line of UNIX code you used to execute the script and create the new dataset. -==== - -=== Question 5 - -Use the newly created dataset, `az_fl_flights_total.csv`, from question 4 to calculate the total number of flights into and out of both states by month, and by year, for a total of 3 columns (year, month, flights). Export this information to a new file called `snowbirds.csv`. - -Load up your newly created dataset and use either R or Python (or some other tool) to create a graphic that illustrates whether or not we believe the "snowbird effect" effects flights. Include a description of your graph, as well as your (anecdotal) conclusion. - -[TIP] -==== -You can use 1 dimensional arrays to accomplish this if the key is the combination of, for example, the year and month. -==== - -.Items to submit -==== -- The line of `awk` code used to create the new dataset, `snowbirds.csv`. -- Code used to create the visualization in a code chunk. -- The generated plot as either a png or jpg/jpeg. -- 1-2 sentences describing your plot and your conclusion. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc deleted file mode 100644 index 36057d199..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project09.adoc +++ /dev/null @@ -1,290 +0,0 @@ -= STAT 39000: Project 9 -- Fall 2020 - -**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://www.cloudflare.com/[cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is _well_ worth your time! - -**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/lahman/lahman.db` - -This is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -[IMPORTANT] -==== -For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd -==== - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db") -``` -```` - -Each solution should then be placed in a code chunk like this: - -````markdown -```{sql, connection=lahman}`r ''` -SELECT * FROM batting LIMIT 1; -``` -```` - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -[source,bash] ----- -sqlite3 /class/datamine/data/lahman/lahman.db ----- - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=lahman, eval=F}`r ''` -SELECT * FROM batting LIMIT 1; -``` -```` - -This will allow the code to be displayed without throwing an error. - -=== Question 1 - -Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available? - -[TIP] -==== -To connect to the database, do the following: -==== - -```{bash, eval=F} -sqlite3 /class/datamine/data/lahman/lahman.db -``` - -[TIP] -==== -https://database.guide/2-ways-to-list-tables-in-sqlite-database/[This] is a good resource. -==== - -.Items to submit -==== -- How many tables are available in the Lahman database? -- The sqlite3 commands used to figure out how many tables are available. -==== - -=== Question 2 - -Some people like to try to https://www.washingtonpost.com/graphics/2017/sports/how-many-mlb-parks-have-you-visited/[visit all 30 MLB ballparks] in their lifetime. Use SQL commands to get a list of `parks` and the cities they're located in. For your final answer, limit the output to 10 records/rows. - -[NOTE] -==== -There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the `LIMIT` clause. -==== - -[TIP] -==== -Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following: - -[source,SQL] ----- -.headers on ----- -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 3 - -There is nothing more exciting to witness than a home run hit by a batter. It's impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (`HR`) in a season. List their `playerID`, `yearID`, home run total, and the `teamID` they played for. - -[TIP] -==== -There are 8 occurrences of home runs greater than or equal to 60. -==== - -[TIP] -==== -The `batting` table is where you should look for this question. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 4 - -Make a list of players born on your birth day (don't worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year. - -[TIP] -==== -The `people` table is where you should look for this question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 5 - -Get the Cleveland (CLE) Pitching Roster from the 2016 season (`playerID`, `W`, `L`, `SO`). Order the pitchers by number of Strikeouts (SO) in descending order. - -[TIP] -==== -The `pitching` table is where you should look for this question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 6 - -Find the 10 team and year pairs that have the most number of Errors (`E`) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs? - -[TIP] -==== -The `teams` table is where you should look for this question. -==== - -[TIP] -==== -The `BETWEEN` clause is useful here. -==== - -[TIP] -==== -It is OK to use multiple queries to answer the question. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 7 - -Find the `playerID` for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table `pitching`)? What year and team did he win the most games as a manager (use table `managers`)? - -[TIP] -==== -It is OK to use multiple queries to answer the question. -==== - -[NOTE] -==== -There was a tie among the two years in which Bob Lemon had the most wins as a pitcher. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== - -=== Question 8 - -For the https://en.wikipedia.org/wiki/American_League_West[AL West] (use `lgID` and `divID` to specify this), find the home run (`HR`), walk (`BB`), and stolen base (`SB`) totals by team between 2000 and 2010. Which team and year combo led in each category in the decade? - -[TIP] -==== -The `teams` table is where you should look for this question. -==== - -[TIP] -==== -It is OK to use multiple queries to answer the question. -==== - -[TIP] -==== -Use `divID == 'W'` as one of the conditions. Please note using double quotes: `divID == "W"` will not work. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The team-year combination that ranked top in each category. -==== - -=== Question 9 - -Get a list of the following by year: wins (`W`), losses (`L`), Home Runs Hit (`HR`), homeruns allowed (`HRA`), and total home game attendance (`attendance`) for the Detroit Tigers when winning a World Series (`WSWin` is `Y`) or when winning league champion (`LgWin` is `Y`). - -[TIP] -==== -The `teams` table is where you should look for this question. -==== - -[TIP] -==== -Be careful with the order of operations for `AND` and `OR`. Remember you can force order of operations using parentheses. -==== - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL code used to solve the problem. -- The first 10 results of the query. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc deleted file mode 100644 index 6aac2e842..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project10.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= STAT 39000: Project 10 -- Fall 2020 - -**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it _will_ start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data. - -**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems. -**** - -== Dataset - -The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at `scholar-db.rcac.purdue.edu`. -As in Project 9, this is the Lahman Baseball Database. You can find its documentation http://www.seanlahman.com/files/database/readme2017.txt[here], including the definitions of the tables and columns. - -== Questions - -[IMPORTANT] -==== -Please make sure to **double check** that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. -==== - -[IMPORTANT] -==== -Please make sure to look at your knit PDF *before* submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points. -==== - -[IMPORTANT] -==== -For this project all solutions should be done using R code chunks, and the `RMariaDB` package. Run the following code to load the library: - -[source,r] ----- -library(RMariaDB) ----- -==== - -=== Question 1 - -Connect to RStudio Server https://rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the `RMariaDB` package in R, and the credentials below. Confirm the connection by running the following code chunk: - -[source,r] ----- -con <- dbConnect(RMariaDB::MariaDB(), - host="scholar-db.rcac.purdue.edu", - db="lahmandb", - user="lahman_user", - password="HitAH0merun") -head(dbGetQuery(con, "SHOW tables;")) ----- - -[TIP] -==== -In the example provided, the variable `con` from the `dbConnect` function is the connection. Each query that you make, using the `dbGetQuery`, needs to use this connection `con`. You can change the name `con` if you want to (it is user defined), but if you change the name `con`, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the `dbConnect` line again, to reset your connection to the database. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your (potentially modified) `head(dbGetQuery(con, "SHOW tables;"))`. -==== - -=== Question 2 - -How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (`SB`) and hit at least 40 home runs (`HR`) in one year. - -[TIP] -==== -Use the `batting` table. -==== - -[IMPORTANT] -==== -You only need to run `library(RMariaDB)` and the `dbConnect` portion of the code a single time towards the top of your project. After that, you can simply reuse your connection `con` to run queries. -==== - -[IMPORTANT] -==== -In our xref:templates.adoc[project template], for this project, make all of the SQL queries using the `dbGetQuery` function, which returns the results directly in `R`. Therefore, your `RMarkdown` blocks for this project should all be `{r}` blocks (as opposed to the `{sql}` blocks used in Project 9). -==== - -[TIP] -==== -You can use `dbGetQuery` to run your queries from within R. Example: - -[source,r] ----- -dbGetQuery(con, "SELECT * FROM battings LIMIT 5;") ----- -==== - -[NOTE] -==== -We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use `RMariaDB` to solve this query. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_os59oucz&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_gy4f8y6j"></iframe> -++++ - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 3 - -How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"? - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_whzcdsrc&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_ruufmsf4"></iframe> -++++ - -[IMPORTANT] -==== -Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use `sum` in R rather than `SUM` in SQL. -==== - -[TIP] -==== -Use the `people` table to find the `playerID` and use the `batting` table to find the statistics. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 4 - -The https://en.wikipedia.org/wiki/Batting_average_(baseball)[Batting Average] is a metric for a batter's performance. The Batting Average in a year is calculated by stem:[\frac{H}{AB}] (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to `playerID`, `teamID`, and `yearID.` - -[TIP] -==== -Use the `batting` table. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 5 - -How many unique players have hit > 50 home runs (`HR`) in a season? - -[TIP] -==== -If you view `DISTINCT` as being paired with `SELECT`, instead, think of it as being paired with one of the fields you are selecting. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 6 - -Find the number of unique players that attended Purdue University. Start by finding the `schoolID` for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic. - -[TIP] -==== -Use the `schools` table to get all `schoolID` and the `collegeplaying` table to get the statistics. -==== - -[TIP] -==== -You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates. -==== - -[TIP] -==== -To find IU's id, try the following query: `SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';`. You can find more about the LIKE clause and `%` https://www.tutorialspoint.com/sql/sql-like-clause.htm[here]. -==== - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== - -=== Question 7 - -Use R, SQL and the lahman database to create an interesting infographic. For those of you who are not baseball fans, try doing a Google image search for "baseball plots" for inspiration. Make sure the plot is polished, has appropriate labels, color, etc. - -.Items to submit -==== -- R code used to solve the problem. -- The result of running the R code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc deleted file mode 100644 index 439fe0043..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project11.adoc +++ /dev/null @@ -1,227 +0,0 @@ -= STAT 39000: Project 11 -- Fall 2020 - -**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values. - -**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset! - -**Scope:** SQL, SQL in R - -.Learning objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc. -- Showcase the ability to filter, alias, and write subqueries. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where. -**** - -== Dataset - -The following questions will use the `elections` database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar: - -`/class/datamine/data/election/itcontYYYY.txt` (for example, data for year 1980 would be `/class/datamine/data/election/itcont1980.txt`) - -A public sample of the data can be found here: - -https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be https://www.datadepot.rcac.purdue.edu/datamine/data/election/itcont1980.txt) - -== Questions - -[IMPORTANT] -==== -For this project you will need to connect to the database `elections` using the `RMariaDB` package in R. Include the following code chunk in the beginning of your RMarkdown file: - -````markdown -```{r setup-database-connection}`r ''` -library(RMariaDB) -con <- dbConnect(RMariaDB::MariaDB(), - host="scholar-db.rcac.purdue.edu", - db="elections", - user="elections_user", - password="Dataelect!98") -``` -```` -==== - -When a question involves SQL queries in this project, you may use a SQL code chunk (with `{sql}`), or an R code chunk (with `{r}`) and functions like `dbGetQuery` as you did in Project 10. Please refer to Question 5 in the xref:åtemplates.adoc[project template] for examples. - -=== Question 1 - -Approximately how large was the lahman database (use the sqlite database in Scholar: `/class/datamine/data/lahman/lahman.db`)? Use UNIX utilities you've learned about this semester to write a line of code to return the size of that .db file (in MB). - -The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder `/class/datamine/data/election/`. How much data (in MB) is there? - -The data in that folder has been added to the `elections` database, all aggregated in the `elections` table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table `elections`? - -[NOTE] -==== -These are some examples of how to get the sizes of collections of files in UNIX: -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_edernjri&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_7g6c4dt2"></iframe> -++++ - -[TIP] -==== -The SQL query will take some time! Be patient. -==== - -[NOTE] -==== -You may use more than one code chunk in your RMarkdown file for the different tasks. -==== - -[NOTE] -==== -We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from `ls` and `du` to match, use the `--apparent-size` option with `du`. -==== - -[NOTE] -==== -A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See https://en.wikipedia.org/wiki/Gigabyte[here] for more information. For this question, either solution will be given full credit. https://thedatamine.github.io/the-examples-book/unix.html#why-is-the-result-of-du--b-.metadata.csv-divided-by-1024-not-the-result-of-du--k-.metadata.csv[This] is a potentially useful example. -==== - -.Items to submit -==== -- Line of code (bash/awk) to show the size (in MB) of the lahman database file. -- Approximate size of the lahman database in MB. -- Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in `/class/datamine/data/election`. -- The size of the elections data in MB. -- SQL query used to find the number of rows of data in the `elections` table in the `elections` database. -- The number of rows in the `elections` table in the `elections` database. -==== - -=== Question 2 - -Write a SQL query using the `LIKE` command to find a unique list of `zip_code` that start with "479". - -Write another SQL query and answer: How many unique `zip_code` are there that begin with "479"? - -[NOTE] -==== -Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_gplhe4dj&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_o71dngd6"></iframe> -++++ - -[TIP] -==== -The first query returns a list of zip codes, and the second returns a count. -==== - -[TIP] -==== -Make sure you only select `zip_code`. -==== - -.Items to submit -==== -- SQL queries used to answer the question. -- The first 5 results from running the query. -==== - -=== Question 3 - -Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an _alias_ for our field so it doesn't read `COUNT(*)` but rather `Indiana Donations`. - -[TIP] -==== -You may enclose an alias's name in quotation marks (single or double) when the name contains space. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- The result of the SQL query. -==== - -=== Question 4 - -Rewrite the query in (3) so the result is displayed like: `IN: 1234567`. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to `OH: 1234567`. In other words, the state abbreviation should be dynamic, not static. - -[NOTE] -==== -This video demonstrates how to use CONCAT in a MySQL query: -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_nu7iovqo&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_31dt64kx"></iframe> -++++ - -[TIP] -==== -Use CONCAT and aliasing to accomplish this. -==== - -[TIP] -==== -Remember, `state` contains the state abbreviation. -==== - -.Items to submit -==== -- SQL query used to answer the question. -==== - -=== Question 5 - -In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes? - -[NOTE] -==== -This video gives two examples of sub-queries: -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_d2zr7cmo&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_4us9nsy9"></iframe> -++++ - -[TIP] -==== -You can simply manually calculate the percent using the count in (2) and (5). -==== - -.Items to submit -==== -- SQL queries used to answer the question. -- The percentage of donations from Indiana from `zip_code`s starting with "479". -==== - -=== Question 6 - -In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (`transaction_amt`). Which 5 states gave the most donations (highest count)? Order you result from most to least. - -[NOTE] -==== -In this video we demonstrate `GROUP BY`, `ORDER BY`, `DESC`, and other aspects of MySQL that might help with this question: -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_530klfwl&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_iej3zqtf"></iframe> -++++ - -[TIP] -==== -You may want to create an alias in order to sort. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- Which 5 states gave the most donations? -==== - -=== Question 7 - -Write a query that gets the number of donations, and sum of donations, by year, for Indiana. Create one or more graphics that highlights the year-by-year changes. Write a short 1-2 sentences explaining your graphic(s). - -.Items to submit -==== -- SQL query used to answer the question. -- R code used to create your graphic(s). -- 1 or more graphics in png/jpeg format. -- 1-2 sentences summarizing your graphic(s). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc deleted file mode 100644 index 9764724b7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project12.adoc +++ /dev/null @@ -1,186 +0,0 @@ -= STAT 39000: Project 12 -- Fall 2020 - -**Motivation:** Databases are comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform joins! In this project we will explore learn about and practice using joins on a database containing bike trip information from the Bay Area Bike Share. - -**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables. - -**Scope:** SQL, sqlite, joins - -.Learning objectives -**** -- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. -- Showcase the ability to filter, alias, and write subqueries. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db` - -A public sample of the data can be found https://www.datadepot.rcac.purdue.edu/datamine/data/bay_area_bike_share/bay_area_bike_share.db[here]. - -[IMPORTANT] -==== -For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd: - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -bikeshare <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/bay_area_bike_share/bay_area_bike_share.db") -``` -```` - -Each solution should then be placed in a code chunk like this: - -````markdown -```{sql, connection=bikeshare}`r ''` -SELECT * FROM station LIMIT 5; -``` -```` -==== - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -```{bash, eval=F} -sqlite3 /class/datamine/data/bay_area_bike_share/bay_area_bike_share.db -``` - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=bikeshare, eval=F}`r ''` -SELECT * FROM station LIMIT 5; -``` -```` - -This will allow the code to be displayed without throwing an error. - -There are a variety of ways to join data using SQL. With that being said, if you are able to understand and use a LEFT JOIN and INNER JOIN, you can perform *all* of the other types of joins (RIGHT JOIN, FULL OUTER JOIN). - -== Questions - -=== Question 1 - -Aliases can be created for tables, fields, and even results of aggregate functions (like MIN, MAX, COUNT, AVG, etc.). In addition, you can combine fields using the `sqlite` concatenate operator `||` see https://www.sqlitetutorial.net/sqlite-string-functions/sqlite-concat/[here]. Write a query that returns the first 5 records of information from the `station` table formatted in the following way: - -`(id) name @ (lat, long)` - -For example: - -`(84) Ryland Park @ (37.342725, -121.895617)` - -[TIP] -==== -Here is a video about how to concatenate strings in SQLite. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_40z55oz9&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_a4f4x2k9"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -- The first 5 records of information from the `station` table. -==== - -=== Question 2 - -There is a variety of interesting weather information in the `weather` table. Write a query that finds the average `mean_temperature_f` by `zip_code`. Which is on average the warmest `zip_code`? - -Use aliases to format the result in the following way: - -```{txt} -Zip Code|Avg Temperature -94041|61.3808219178082 -``` -Note that this is the output if you use `sqlite` in the terminal. While the output in your knitted pdf file may look different, you should name the columns accordingly. - -[TIP] -==== -Here is a video about GROUP BY, ORDER BY, DISTINCT, and COUNT -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_soilqf5i&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_hwehjhwd"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -- The results of the query copy and pasted. -==== - -=== Question 3 - -From (2) we can see that there are only 5 `zip_code`s with weather information. How many unique `zip_code`s do we have in the `trip` table? Write a query that finds the number of unique `zip_code`s in the `trip` table. Write another query that lists the `zip_code` and count of the number of times the `zip_code` appears. If we had originally assumed that the `zip_code` was related to the location of the trip itself, we were wrong. Can you think of a likely explanation for the unexpected `zip_code` values in the `trip` table? - -[TIP] -==== -There could be missing values in `zip_code`. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL https://www.w3schools.com/sql/sql_null_values.asp[here]. -==== - -.Items to submit -==== -- SQL queries used to solve this problem. -- 1-2 sentences explainging what a possible explanation for the `zip_code`s could be. -==== - -=== Question 4 - -In (2) we wrote a query that finds the average `mean_temperature_f` by `zip_code`. What if we want to tack on our results in (2) to information from each row in the `station` table based on the `zip_code`? To do this, use an INNER JOIN. INNER JOIN combines tables based on specified fields, and returns only rows where there is a match in both the "left" and "right" tables. - -[TIP] -==== -Use the query from (2) as a sub query within your solution. -==== - -[TIP] -==== -Here is a video about JOIN and LEFT JOIN. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_5ugjqrhk&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_i4qp5bam"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve this problem. -==== - -=== Question 5 - -In (3) we alluded to the fact that many `zip_code` in the `trip` table aren't very consistent. Users can enter a zip code when using the app. This means that `zip_code` can be from anywhere in the world! With that being said, if the `zip_code` is one of the 5 `zip_code` for which we have weather data (from question 2), we can add that weather information to matching rows of the `trip` table. In (4) we used an INNER JOIN to append some weather information to each row in the `station` table. For this question, write a query that performs an INNER JOIN and appends weather data from the `weather` table to the trip data from the `trip` table. Limit your output to 5 lines. - -[IMPORTANT] -==== -Notice that the weather data has about 1 row of weather information for each date and each zip code. This means you may have to join your data based on multiple constraints instead of just 1 like in (4). In the `trip` table, you can use `start_date` for for the date information. -==== - -[TIP] -==== -You will want to wrap your dates and datetimes in https://www.sqlitetutorial.net/sqlite-date-functions/sqlite-date-function/[sqlite's `date` function] prior to comparison. -==== - -.Items to submit -==== -- SQL query used to solve this problem. -- First 5 lines of output. -==== - -=== Question 6 - -How many rows are in the result from (5) (when not limiting to 5 lines)? How many rows are in the `trip` table? As you can see, a large proportion of the data from the `trip` table did not match the data from the `weather` table, and therefore was removed from the result. What if we want to keep all of the data from the `trip` table and add on data from the `weather` table if we have a match? Write a query to accomplish this. How many rows are in the result? - -.Items to submit -==== -- SQL query used to find how many rows from the result in (5). -- The number of rows in the result of (5). -- SQL query to find how many rows are in the `trip` table. -- The number of rows in the `trip` table. -- SQL query to keep all of the data from the `trip` table and add on matching data from the `weather` table when available. -- The number of rows in the result. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc deleted file mode 100644 index 5124e4c19..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project13.adoc +++ /dev/null @@ -1,180 +0,0 @@ -= STAT 39000: Project 13 -- Fall 2020 - -**Motivation:** Databases you will work with won't necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like _primary key_, and _foreign key_ will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice! - -**Context:** We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you've already learned, with a focus on subqueries and joins. - -**Scope:** SQL, sqlite - -.Learning objectives -**** -- Write and run SQL queries in `sqlite` on real-world data. -- Identify primary and foreign keys in a SQL table. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/movies_and_tv/imdb.db` - -[IMPORTANT] -==== -For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd: - -````markdown -```{r, include=F}`r ''` -library(RSQLite) -imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db") -``` -```` -==== - -If you want to use a SQLite-specific function like `.tables` (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at https://rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command: - -```{bash, eval=F} -sqlite3 /class/datamine/data/movies_and_tv/imdb.db -``` - -From there, the SQLite-specific commands will function properly. They will _not_ function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option `eval=F` like this: - -````markdown -```{sql, connection=imdb, eval=F}`r ''` -SELECT * FROM titles LIMIT 5; -``` -```` - -This will allow the code to be displayed without throwing an error. - -== Questions - -=== Question 1 - -A primary key is a field in a table which uniquely identifies a row in the table. Primary keys _must_ be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the `titles` table. Do you think there are any primary keys? How about foreign keys? Now examine the `episodes` table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys? - -[TIP] -==== -A primary key can also be a foreign key. -==== - -[TIP] -==== -Here are two videos. The first video will remind you how to find the names of all of the tables in the `imdb` database. The second video will introduce you to the `titles` and `episodes` tables in the `imdb` database. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_7ktvbhc9&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_ae112udc"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_wc2hl3xm&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_y7qo7jpu"></iframe> -++++ - -.Items to submit -==== -- List any primary or foreign keys in the `titles` table. -- List any primary or foreign keys in the `episodes` table. -==== - -=== Question 2 - -If you paste a `title_id` to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series _Grey's Anatomy_. Write a SQL query to confirm that the `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. Then browse imdb.com and find your favorite TV show. Get the `title_id` from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database: - -[source,SQL] ----- -SELECT * FROM titles WHERE title_id='<title id here>'; ----- - -Make sure to replace "<title id here>" with the `title_id` of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons. - -.Items to submit -==== -- SQL query used to confirm that `title_id` tt0413573 does indeed belong to _Grey's Anatomy_. -- The output of the query. -- The `title_id` of your favorite TV show. -- SQL query used to confirm the `title_id` for your favorite TV show. -- The output of the query. -==== - -=== Question 3 - -The `episode_title_id` column in the `episodes` table references titles of individual episodes of a TV series. The `show_title_id` references the titles of the show itself. With that in mind, write a query that gets a list of all `episodes_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of _Grey's Anatomy_. - -[TIP] -==== -This video shows how to extract titles of episodes in the `imdb` database. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_uhg3atol&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_wmo98brv"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -==== - -=== Question 4 - -We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of `episode_title_id` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode. - -[TIP] -==== -This part is just like question (3) but this time with your favorite TV show, which you chose in (2). -==== - -[TIP] -==== -This video shows how to use a subquery, to `JOIN` a total of three tables in the `imdb` database. -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_jb8vd4nc&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_sc5yje1a"></iframe> -++++ - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -- The first 5 results from your query. -==== - -=== Question 5 - -Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode? - -[NOTE] -==== -Examples that utilize the relevant topics in this problem can be found xref:programming-languages:SQL:queries.adoc[here]. -==== - -.Items to submit -==== -- SQL query used to solve the problem in a code chunk. -- The `episode_title_id`, `primary_title`, and `rating` of the top rated episode from your favorite TV series, in question (2). -- A statement saying whether the highest rated episode is also your favorite episode. -==== - - -=== Question 6 - -Write a query that returns the `season_number` (from the `episodes` table), and average `rating` (from the `ratings` table) for each season of your favorite TV show from (2). Write another query that only returns the season number and `rating` for the highest rated season. Consider the highest rated season the season with the highest average. - -.Items to submit -==== -- The 2 SQL queries used to solve the problems in two code chunks. -==== - -=== Question 7 - -Write a query that returns the `primary_title` and `rating` of the highest rated episode per season for your favorite TV show from question (2). - -[NOTE] -==== -You can show one highest rated episode for each season, without the need to worry about ties. -==== - -.Items to submit -==== -- The SQL query used to solve the problem. -- The output from your query. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc deleted file mode 100644 index 7a685957f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project14.adoc +++ /dev/null @@ -1,169 +0,0 @@ -= STAT 39000: Project 14 -- Fall 2020 - -**Motivation:** As we learned earlier in the semester, bash scripts are a powerful tool when you need to perform repeated tasks in a UNIX-like system. In addition, sometimes preprocessing data using UNIX tools prior to analysis in R or Python is useful. Ample practice is integral to becoming proficient with these tools. As such, we will be reviewing topics learned earlier in the semester. - -**Context:** We've just ended a series of projects focused on SQL. In this project we will begin to review topics learned throughout the semester, starting writing bash scripts using the various UNIX tools we learned about in Projects 3 through 8. - -**Scope:** awk, UNIX utilities, bash scripts, fread - -.Learning objectives -**** -- Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc. -- Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc. -- Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc. -- Use grep to search files effectively. -- Use cut to section off data from the command line. -- Use piping to string UNIX commands together. -- Use awk for data extraction, and preprocessing. -- Create bash scripts to automate a process or processes. -**** - -== Dataset - -The following questions will use ENTIRE_PLOTSNAP.csv from the data folder found in Scholar: - -`/anvil/projects/tdm/data/forest/` - -To read more about ENTIRE_PLOTSNAP.csv that you will be working with: - -https://www.uvm.edu/femc/data/archive/project/federal-forest-inventory-analysis-data-for/dataset/plot-level-data-gathered-through-forest/metadata#fields - -== Questions - -=== Question 1 - -Take a look at at `ENTIRE_PLOTSNAP.csv`. Write a line of awk code that displays the `STATECD` followed by the number of rows with that `STATECD`. - -.Items to submit -==== -- Code used to solve the problem. -- Count of the following `STATECD`s: 1, 2, 4, 5, 6 -==== - -=== Question 2 - -Unfortunately, there isn't a very accessible list available that shows which state each `STATECD` represents. This is no problem for us though, the dataset has `LAT` and `LON`! Write some bash that prints just the `STATECD`, `LAT`, and `LON`. - -[NOTE] -==== -There are 92 columns in our dataset: `awk -F, 'NR==1{print NF}' ENTIRE_PLOTSNAP.csv`. To create a list of `STATECD` to state, we only really need `STATECD`, `LAT`, and `LON`. Keeping the other 89 variables will keep our data at 2.6gb. -==== - -.Items to submit -==== -- Code used to solve the problem. -- The output of your code piped to `head`. -==== - -=== Question 3 - -`fread` is a "Fast and Friendly File Finagler". It is part of the very popular `data.table` package in R. We will learn more about this package next semester. For now, read the documentation https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/fread[here] and use the `cmd` argument in conjunction with your bash code from (2) to read the data of `STATECD`, `LAT`, and `LON` into a `data.table` in your R environment. - -.Items to submit -==== -- Code used to solve the problem. -- The `head` of the resulting `data.table`. -==== - -=== Question 4 - -We are going to further understand the data from question (3) by finding the actual locations based on the `LAT` and `LON` columns. We can use the library `revgeo` to get a location given a pair of longitude and latitude values. `revgeo` uses a free API hosted by https://github.com/komoot/photon[photon] in order to do so. - -For example: - -[source,r] ----- -library(revgeo) -revgeo(longitude=-86.926153, latitude=40.427055, output='frame') ----- - -The code above will give you the address information in six columns, from the most-granular `housenumber` to the least-granular `country`. Depending on the coordinates, `revgeo` may or may not give you results for each column. For this question, we are going to keep only the `state` column. - -There are over 4 million rows in our dataset -- we do _not_ want to hit https://github.com/komoot/photon[photon's] API that many times. Instead, we are going to do the following: - -* Unless you feel comfortable using `data.table`, convert your `data.table` to a `data.frame`: - -[source,r] ----- -my_dataframe <- data.frame(my_datatable) ----- - -* Calculate the average `LAT` and `LON` for each `STATECD`, and call the new `data.frame`, `dat`. This should result in 57 rows of lat/long pairs. - -* For each row in `dat`, run a reverse geocode and append the `state` to a new column called `STATE`. - -[TIP] -==== -To calculate the average `LAT` and `LON` for each `STATECD`, you could use the https://www.rdocumentation.org/packages/sqldf/versions/0.4-11[`sqldf`] package to run SQL queries on your `data.frame`. -==== - -[TIP] -==== -https://stackoverflow.com/questions/3505701/grouping-functions-tapply-by-aggregate-and-the-apply-family[`mapply`] is a useful apply function to use to solve this problem. -==== - -[TIP] -==== -Here is some extra help: - -[source,r] ----- -library(revgeo) -points <- data.frame(latitude=c(40.433663, 40.432104, 40.428486), longitude=c(-86.916584, -86.919610, -86.920866)) -# Note that the "output" argument gets passed to the "revgeo" function. -mapply(revgeo, points$longitude, points$latitude, output="frame") -# The output isn't in a great format, and we'd prefer to just get the "state" data. -# Let's wrap "revgeo" into another function that just gets "state" and try again. -get_state <- function(lon, lat) { - return(revgeo(lon, lat, output="frame")["state"]) -} -mapply(get_state, points$longitude, points$latitude) ----- -==== - -[IMPORTANT] -==== -It is okay to get "Not Found" for some of the addresses. -==== - -.Items to submit -==== -- Code used to solve the problem. -- The `head` of the resulting `data.frame`. -==== - -=== Question 5 - -Use the `leaflet`, `addTiles`, and `addCircles` functions from the `leaflet` package to map our average latitude and longitude data from question (4) to a map (should be a total of 57 lat/long pairs). - -[TIP] -==== -See https://thedatamine.github.io/the-examples-book/r.html#r-ggmap[here] for an example of adding points to a map. -==== - -.Items to submit -==== -- Code used to create the map. -- The map itself as output from running the code chunk. -==== - -=== Question 6 - -Write a bash script that accepts at least 1 argument, and performs a useful task using at least 1 dataset from the `forest` folder in `/anvil/projects/tdm/data/forest/`. An example of a useful task could be printing a report of summary statistics for the data. Feel free to get creative. Note that tasks must be non-trivial -- a bash script that counts the number of lines in a file is _not_ appropriate. Make sure to properly document (via comments) what your bash script does. Also ensure that your script returns columnar data with appropriate separating characters (for example a csv). - -.Items to submit -==== -- The content of your bash script starting from `#!/bin/bash`. -- Example output from running your script as intended. -- A description of what your script does. -==== - -=== Question 7 - -You used `fread` in question (2). Now use the `cmd` argument in conjunction with your script from (6) to read the script output into a `data.table` in your R environment. - -.Items to submit -==== -- The R code used to read in and preprocess your data using your bash script from (6). -- The `head` of the resulting `data.table`. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc b/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc deleted file mode 100644 index ec17ad6e2..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2020/39000/39000-f2020-project15.adoc +++ /dev/null @@ -1,85 +0,0 @@ -= STAT 39000: Project 15 -- Fall 2020 - -**Motivation:** We've done a lot of work with SQL this semester. Let's review concepts in this project and mix and match R and SQL to solve data-driven problems. - -**Context:** In this project, we will reinforce topics you've already learned, with a focus on SQL. - -**Scope:** SQL, sqlite, R - -.Learning objectives -**** -- Write and run SQL queries in `sqlite` on real-world data. -- Use SQL from within R. -**** - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/movies_and_tv/imdb.db` - -F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One ...". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let's use SQL and R to take a look at how many times each characters' names appear in the title of the episodes. - -== Questions - -=== Question 1 - -Write a query that gets the `episode_title_id`, `primary_title`, `rating`, and `votes`, of all of the episodes of Friends (`title_id` is tt0108778). - -[TIP] -==== -You can slightly modify the solution to question (5) in project 13. -==== - -.Items to submit -==== -- SQL query used to answer the question. -- First 5 results of the query. -==== - -=== Question 2 - -Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name? - -.Items to submit -==== -- R code in a code chunk that was used to find the solution. -- The solution pasted below the code chunk. -==== - -=== Question 3 - -Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37. - -.Items to submit -==== -- The R code used to generate the graphic. -- The graphic in a png or jpg/jpeg format. -==== - -=== Question 4 - -Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see `type` column from `titles` table): Romance, Comedy, Animation. In the `titles` table, you can find the genres in the `genres` column. There may be some overlap (i.e. a movie may have more than one genre), this is ok. - -To query rows which have the genre Action as one of its genres: - -[source,SQL] ----- -SELECT * FROM titles WHERE genres LIKE '%action%'; ----- - -.Items to submit -==== -- Any code you used to solve the problem in a code chunk. -- The average rating of each of the genres listed for movies. -==== - -=== Question 5 - -Write a function called `top_episode` in R which accepts the path to the `imdb.db` database, as well as the `title_id` of a tv series (for example, "tt0108778" or "tt1266020"), and returns the `season_number`, `episode_number`, `primary_title`, and `rating` of the highest rated episode in the series. Test it out on some of your favorite series, and share the results. - -.Items to submit -==== -- Any code you used to solve the problem in a code chunk. -- The results for at least 3 of your favorite tv series. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc deleted file mode 100644 index 15978aaad..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project01.adoc +++ /dev/null @@ -1,265 +0,0 @@ -= STAT 19000: Project 1 -- Fall 2021 - -== Welcome to The Data Mine! - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** r, Jupyter Lab, Brown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Brown. -- Read and write basic (csv) data using R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/flights/subset/1990.csv` -- `/depot/datamine/data/movies_and_tv/imdb.db` -- `/depot/datamine/data/disney/splash_mountain.csv` - -== Questions - -=== Question 1 - -For this course, projects will be solved using the Brown computing cluster: https://www.rcac.purdue.edu/compute/brown[Brown]. We may also use the Scholar computing cluster in the future (we have used Scholar in previous years). - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for both clusters, combined. - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer, provide a link to a common computer and provide the information for it instead. - -.Items to submit -==== -- A sentence explaining how many cores and how much memory is available, in total, across all nodes on Brown. -- A sentence explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 - -In previous semesters, we used a program called RStudio Server to run R code on Scholar and solve the projects. This year, instead, we will be using Jupyter Lab on the Brown cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dv46pmsw?wid=_983291"></iframe> -++++ - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb or 4096 Mb of memory. We use the Brown cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2021-s2022:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/book/projects/templates[our template page]. -==== - -f2021-s2022-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2021-s2022-r kernel. Click on btn:[f2021-s2022-r], and a fresh notebook will be created for you. - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Brown that you are running on? - -[source,r] ----- -system("hostname", intern=TRUE) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node on Brown that you are running on). -==== - -=== Question 3 - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2021-s2022-r`. If you click on this name you will have the option to swap kernels out. Change kernels to the `f2021-s2022` kernel, and practice by running the following code examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%load_ext sql ----- - -and then, in a separate cell: - -[source, sql] ----- -%%sql -sqlite:////depot/datamine/data/movies_and_tv/imdb.db -SELECT * FROM titles LIMIT 5; ----- - - -bash:: -[source,bash] ----- -%%bash -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /depot/datamine/data/flights/subset/1990.csv ----- - -[TIP] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/book/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/book/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Brown at `/depot/datamine/apps/templates/project_template.ipynb`). - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5msf7x1z?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_06emyzsv?wid=_983291"></iframe> -++++ - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 5 - -In question (1) we answered questions about cores and memory for the Brown clusters. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper, or mental math for you good mental math folks), write these calculations using R _and_ Python, in separate code cells. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_e6qlkppq?wid=_983291"></iframe> -++++ - -In the previous question, we ran our first R and Python code. In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code. - -[source,r] ----- -dat <- read.csv("/depot/datamine/data/disney/splash_mountain.csv") ----- - -Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset. - -[source,r] ----- -head(dat) ----- - -`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`. - -Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `splash_mountain`. Place all of your code into a new cell. Be sure to include a level 2 header titled "Question 6", above your code cell. - -[TIP] -==== -In markdown, a level 2 header is any line starting with 2 `\#`'s. For example, `\#\# Question X` is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. -==== - -[TIP] -==== -If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run. -==== - -[NOTE] -==== -We didn't need to re-read in our data in this question to make our dataset be named `splash_mountain`. We could have re-named `dat` to be `splash_mountain` like this. - -[source,r] ----- -splash_mountain <- dat ----- - -Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5Gb dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `splash_mountain` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset. - -[source,r] ----- -rm(dat) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -Let's pretend we are now done with the project. We've written some code, maybe added some markdown cells to explain what we did, and we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a PDF which contains text, code, and code output. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your PDF before submitting it. A _very_ common mistake is to assume that your PDF has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. -==== - -A PDF is generated by first running every cell in the notebook, and then exporting to a PDF. - -In addition to the PDF, if a project uses R code, you will need to also submit R code in an R script. An R script is just a text file with the extension `.R`. When submitting Python code, you will need to also submit a Python script. A Python script is just a text file with the extension `.py`. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Next, take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Compile your PDF -- making sure that the output from all of your code is present and in the PDF. - -Once complete, submit your PDF, R script, and Python script. - -.Items to submit -==== -- Resulting PDF (`firstname-lastname-project01.pdf`). -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc deleted file mode 100644 index cd0653150..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project02.adoc +++ /dev/null @@ -1,168 +0,0 @@ -= STAT 19000: Project 2 -- Fall 2021 - -== Introduction to R using https://whin.org[WHIN] weather data - -**Motivation:** The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn some of the basics while solving data-driven problems. - -[NOTE] -==== -R and Python both have their advantages and disadvantages. There still exist domains and problems where R is better than Python, and where Python is better than R. In addition, https://julialang.org/[Julia] is another language in this domain that is quickly gaining popularity for it's speed and Python-like ease of use. -==== - -**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often a more effective tool to use than a tool like spreadsheets. - -**Scope:** r, vectors, indexing, recycling - -.Learning Objectives -**** -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Explain and demonstrate: positional, named, and logical indexing. -- Read and write basic (csv) data using R. -- Identify good and bad aspects of simple plots. -- Explain what "recycling" is in R and predict behavior of provided statements. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/whin/stations.csv` -- `/depot/datamine/data/whin/weather.csv` - -[NOTE] -==== -These datasets are generously provided to us by one of our corporate partners, the Wabash Heartland Innovation Network (WHIN). You can learn more about WHIN on their website at https://whin.org/[WHIN]. You can learn more about their API https://data.whin.org[here]. This won't be the last time we work with WHIN data, in the future you will get the opportunity to use their API to solve problems that you might not have thought of. -==== - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8dfl940e?wid=_983291"></iframe> -++++ - -While you may not always (and perhaps even rarely) be provided with a neat and clean dataset to work on, when you do, getting a good feel for the dataset(s) is a good first step to solving any data-driven problems. - -Use the `read.csv` function to load our datasets into `data.frame`s named `stations` and `weather`. - -[NOTE] -==== -`read.csv` loads data into a `data.frame` object _by default_. We will learn more about the idea of a `data.frame` in the future. For now, just think of it like a spreadsheet, in which data in each column has the same type of data (e.g. numeric data, strings, etc.). -==== - -Use functions like `head`, `tail`, `str`, and `summary` to explore the data. What are the dimensions of each dataset? What are the first 5 rows of `stations`? What are the first 5 rows of `weather`? What are the names of the columns in each dataset? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Text answering all of the questions above. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gsbxiet5?wid=_983291"></iframe> -++++ - -The following R code extracts the `temperature` column from our `weather` `data.frame`, into a vector named `temp`. - -[source,r] ----- -temp <- weather$temperature ----- - -What is the first value in the vector? How about the 100th? What is the last? What type of data is in the vector? - -[TIP] -==== -Use the `typeof` function to find out the type of data in a vector. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hjxijnnp?wid=_983291"></iframe> -++++ - -You should now know at least 1 method for extracting data from a `data.frame` (using the `$`), and should now understand a little bit about indexing. Thats great! Use indexing to add the first 100 `rain_inches_last_hour` from the `weather` `data.frame` to the last 100 `rain_inches_last_hour` from the `weather` `data.frame` to a new vector named `temp100`. Do this in 1 line of code. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_y9c23zro?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_3lk9g3qi?wid=_983291"></iframe> -++++ - -In question (3) we were able to rapidly add values together from two subsets of the same vector. This worked out very nicely because both subsets of data contained 100 values. The first value from the first subset of data was added to the first value from the second subset of data, and so on. - -For station with `station_id` 20, get a vector containing all `temperature` >= 85. Call this vector `hot_temps`. Get a vector containing all `temperature` \<= 40, and call this vector `cold_temps`. How many elements are in `hot_temps`? How many elements are in `cold_temps`? Attempt to add the vectors together. What happens? Read https://excelkingdom.blogspot.com/2018/01/what-recycling-of-vector-elements-in-r.html[this] to understand what is happening. - -[NOTE] -==== -This is called _recycling_. Recycling is a very powerful feature of R. It allows you to reuse the same vector elements in different contexts. It can also be a very misleading and dangerous feature as it can lead to unexpected results. This is why it is important to pay attention when R gives you a warning -- something that you aren't expecting may be happening behind the scenes. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8wb6dggb?wid=_983291"></iframe> -++++ - -Pick any station you are interested in, and create one or more dotplots showing data from any of the columns you are interested in. For each plot, write 1-2 sentences describing any patterns you see. If you don't see any patterns, that is okay, just write, "I don't see any patterns.". - -[TIP] -==== -This is a good opportunity to look at the data in the dataset and explore the variables and see what types of patterns the various variables have. Please feel free to spruce up your plots if you so desire -- it is completely optional, and we will have plenty of time to work on plots as the semester progresses. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences describing any patterns you see. -==== - -=== Question 6 - -The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called `ggplot`. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any). - -image::figure04.webp[Plot 1, width=400, height=400, loading=lazy, title="Plot 1"] - -image::figure05.webp[Plot 2, width=400, height=400, loading=lazy, title="Plot 2"] - -image::figure06.webp[Plot 3, width=400, height=400, loading=lazy, title="Plot 3"] - -.Items to submit -==== -- 1-2 sentences explaining which is your favorite graphic, why, what could be improved, and any interesting observations you may have (if any). -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc deleted file mode 100644 index ac15d094c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project03.adoc +++ /dev/null @@ -1,175 +0,0 @@ -= STAT 19000: Project 3 -- Fall 2021 - -**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`. - -**Context:** In the previous project we got our feet wet, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s. - -**Scope:** r, data.frames, recycling, factors - -.Learning Objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/olympics/*.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ycskxb95?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ypmoqxd1?wid=_983291"></iframe> -++++ - -Use R code to print the names of the two datasets in the `/depot/datamine/data/olympics` directory. - -Read the larger dataset into a data.frame called `olympics`. - -Print the first 6 rows of the `olympics` data.frame, and take a look at the columns. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds. - -**Relevant topics:** list.files, file.info, read.csv, head - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining the dataset. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_98pu82xv?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_qt0ygjlf?wid=_983291"></iframe> -++++ - -How many unique sports are accounted for in our `olympics` dataset? Print a list of the sports. Is there any sport that you weren't expecting? Why or why not? - -[IMPORTANT] -==== -R is a case-sensitive language. What this means is that whether or not 1 or more letters in a word are capitalized is important. For example, the following two variables are different. - -[source,r] ----- -vec <- c(1,2,3) -Vec <- c(3,2,1) # note the capital "V" in our variable name - -print(vec) # will print: 1,2,3 -print(Vec) # will print: 3,2,1 ----- - -So, when you are examining a `data.frame` and you see a column name that starts with a capital letter, it is critical that you use the same capitalization when trying to access said column. - -[source,r] ----- -colnames(iris) ----- - ----- -[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" ----- - -[source,r] ----- -iris$"sepal.Length" # will NOT work -iris$"Sepal.length" # will NOT work -iris$"Sepal.Length" # will work ----- -==== - -**Relevant topics:** unique, length - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining the results. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_fapacg9o?wid=_983291"></iframe> -++++ - -Create a data.frame called `us_athletes` that contains only information on athletes from the USA. Use the column `NOC` (National Olympic Committee 3-letter code). How many rows does `us_athletes` have? - -Now, perform the same operation on the `olympics` data.frame, this time containing only the information on the athletes from the country of your choice. Name this new data.frame appropriately. How many rows does it have? - -Now, create a data.frame called `both` that contains the information on the athletes from the USA and the country of your choice. How many rows does it have? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- How many rows or athletes in the `us_athletes` dataset? -- How many rows or athletes in the other country's dataset? -- How many rows or athletes in the `both` dataset? -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4bc65pzr?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4uzibkfm?wid=_983291"></iframe> -++++ - -What percentage of US athletes are women? What percentage of US athletes with gold medals are women? - -Answer the same questions for your "other" country from question (3). - -**Relevant topics:** prop.table, table, indexing - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_38avc411?wid=_983291"></iframe> -++++ - -What is the oldest US athlete to compete based on our `us_athletes` data.frame? At what age, in which sport, and what year did the athlete compete in? - -Answer the same questions for your "other" country from question (3) and question (4). - -[IMPORTANT] -==== -Make sure you using indexing to _only_ print the athlete's information (age, sport, year). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Age, sport, and olympics year that the oldest athlete competed in, for each of your countries. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc deleted file mode 100644 index 27ff1acd5..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project04.adoc +++ /dev/null @@ -1,187 +0,0 @@ -= STAT 19000: Project 4 -- Fall 2021 - -**Motivation:** Control flow consists of the tools and methods that you can use to control the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops. - -**Context:** We are gaining familiarity working in Jupyter Lab and writing R code. In this project we introduce and practice using control flow in R, while continuing to reinforce concepts from the previous projects. - -**Scope:** r, data.frames, recycling, factors, if/else, for loops - -.Learning Objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/olympics/*.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_o18gov51?wid=_983291"></iframe> -++++ - -Winning an olympic medal is an amazing achievement for athletes. - -Before we take a deep dive into our data, it is always important to do some sanity checks, and understand what population our sample is representative of, particularly if we didn't get a chance to participate in the data collection phase of the project. - -Lets do a quick check on our dataset. We would expect that most athletes would not have won a medal. What percentage of athletes did not get a medal in the `olympics` data.frame? - -For simplicity, consider an "athlete" a row of the data.frame. Do not worry about the same athlete participating in the different olympics games, or in different sports. - -We are considering the combination of `Sport`, `Event`, and `Games`, as a unique identifier for an athlete. - -**Relevant topics:** is.na, mean, indexing, sum - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xbdc5zft?wid=_983291"></iframe> -++++ - -A friend of yours hypothesized that there is some association between getting a medal and the athletes age. - -You want to test it out using our `olympics` data.frame. To do so we will compare 2 new variables: - -. (This question) An indicator if the athlete in that year and sport won a medal or not -. (Next question) Age converted into categories of ages. - -Create a new variable in your `olympics` data.frame called `won_medal` which indicates whether the athlete in that year and sport won a medal or not. - -**Relevant topics:** is.na - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_n9q1j5uw?wid=_983291"></iframe> -++++ - -Now that we have our first new column, `won_medal`, let's categorize the `Age` column. Using a for loop and if/else statements, create a new column in the `olympics` data.frame called `age_cat`. Use the guidelines below to do so. - -- "youth": less than 18 years old -- "young adult": between 18 and 25 years old -- "adult": 26 to 35 years old -- "middle age adult": between 36 to 55 years old -- "wise adult": greater than 55 years old - -How many athletes are "young adults"? - -[TIP] -==== -Remember to consider the `NA`s as you are solving the problem. -==== - -**Relevant topics:** nrow, if/else, for loops, indexing, is.na - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- How many athletes are "young adults"? -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hty7aegu?wid=_983291"></iframe> -++++ - -We did _not_ need to use a for loop to solve the previous problem. Another way to solve the problem would have been to use a vectorized function called `cut`. - -Create a variable called `age_cat_cut` using the `cut` function to solve the problem above. - -[TIP] -==== -To check that you are getting the same results, run the following commands. - -If you used `cut`s `labels` argument in your code: - -[source,r] ----- -all.equal(as.character(age_cat_cut), olympics$age_cat) ----- - -If you didn't use `cut`s `labels` argument in your code: - -[source,r] ----- -levels(age_cat_cut) <- c('youth', 'young adult', 'adult', 'middle age adult', 'older adult') -all.equal(as.character(age_cat_cut), olympics$age_cat) ----- -==== - -[TIP] -==== -Note that by default `cut` considers the breaks as right intervals. For example, if the breaks are c(a,b,c) the intervals will be "(a, b], (b, c]". -==== - -[TIP] -==== -You can use the argument `labels` in `cut` to label the categories similarly to what we did in question (2). -==== - -[NOTE] -==== -These past 2 questions do a good job emphasizing the importance of vectorized functions. How long did it take you to run the solution to question (3) vs question (4)? If you find yourself looping through one or more columns one at a time, there is likely a better option. -==== - -**Relevant topics:** cut - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Now that we have the new columns in the `olympics` data.frame, look at the data and write down your conclusions. Is there some association between winning a medal and the athletes age? - -There a couple of ways you can look at the data to make your conclusions. You can visualize using plots, using functions like `barplot`, and `pie`. Alternatively, you can use numeric summaries, like a table or table with proportions (`prop.table`). Regardless of the method used, explain your findings, and feel free to get creative! - -[NOTE] -==== -You do not need to use any special statistical test to make your conclusions. The goal of this question is to explore the data and think logically. -==== - -[TIP] -==== -The argument `margin` may be useful if you use the `prop.table` function. -==== - -**Relevant topics:** barplot, pie, indexing, table, prop.table, balloonplot - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc deleted file mode 100644 index d38c80587..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project05.adoc +++ /dev/null @@ -1,248 +0,0 @@ -= STAT 19000: Project 5 -- Fall 2021 - -**Motivation:** As briefly mentioned in project 4, R differs from other programming languages in that _typically_ you will want to avoid using for loops, and instead use vectorized functions and the "apply" suite. In this project we will use vectorized functions to solve a variety of data-driven problems. - -**Context:** While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things. - -**Scope:** r, data.frames, recycling, factors, if/else, for loops, apply suite - -.Learning Objectives -**** -- Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: for loops. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/youtube/*.{csv,json}` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_160aaijj?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_26d4k2ug?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_olcpdoam?wid=_983291"></iframe> -++++ - -Read the dataset `USvideos.csv` into a data.frame called `us_youtube`. The dataset contains YouTube trending videos between 2017 and 2018. - -[NOTE] -==== -The dataset has two columns that refer to time: `trending_date` and `publish_time`. - -The column `trending_date` is organized in a `[year].[day].[month]` format, while the `publish_time` is in a different format. -==== - -When working with dates, it is important to use tools specifically for this purpose (rather, than using string manipulation, for example). We've provided you with the code below. The provided code uses the `lubridate` package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] in case you'd like to learn more about the package. - -Run the code below to extract to create two new columns: `trending_year` and `publish_year`. - -[source,r] ----- -library(lubridate) - -# convert columns to date formats -us_youtube$trending_date <- ydm(us_youtube$trending_date) -us_youtube$publish_time <- ymd_hms(us_youtube$publish_time) - -# extract the trending_year and publish_year -us_youtube$trending_year <- year(us_youtube$trending_date) -us_youtube$publish_year <- year(us_youtube$publish_time) - -unique(us_youtube$trending_year) -unique(us_youtube$publish_year) ----- - -Take a look at our newly created columns. What type are the new columns? In the provided code, which (if any) of the 4 functions are vectorized? - -Now, duplicate the functionality of the provided code using only the following functions: `as.numeric`, `substr`, and regular vectorized operations like `+`, `-`, `*` and `/`. Which was easier? - -**Relevant topics:** read.csv, typeof - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5a2ku783?wid=_983291"></iframe> -++++ - -While some great content certainly comes out of the United States, we have a lot of other great content from other countries. Plus, the size of the data is reasonable to combine into a single data.frame. - -Look in the following directory: `/depot/datamine/data/youtube`. You will find files that look like this: - ----- -CAvideos.csv -DEvideos.csv -USvideos.csv -... ----- - -You will notice how each dataset follows the same naming convention. Each file starts with the country code, `US`, `DE`, `CA`, etc, followed immediately by "videos.csv". - -Use a loop and the given vector to systematically combine the data into a new data.frame called `yt`. - -[source,r] ----- -countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US') ----- - -In a loop, loop through each of the values in `countries`. Use the `paste0` function to create a string that is the absolute path to each of the files. So, for example, the following would represent the steps to perform in the first loop. - -- In our first loop we have the value `CA`. -- We would use `paste0` to create a string containing the absolute path of the corresponding dataset: `/depot/datamine/data/youtube/CAvideos.csv`. -- Then, we would then use that string as an argument to the `read.csv` function to read in the data into a data.frame. -- Then, we would add the new column `country_code` to the data.frame with the value `CA` repeated for each row. -- Finally, you would use the rbind function to combine the new data.frame with the previous data.frame. - -In the end, you will end up with a single data.frame called `yt`, that contains the data for _every_ country in the dataset. `yt` will _also_ have a column called `country_code` that contains the country code for each row, so we know where the data originated. - -[IMPORTANT] -==== -When combining data, it is important that we don't lose any data in the process. If we slapped together all of the data from each of the datasets into a single file named `yt.csv`, what data would we lose? -==== - -In order to prevent this loss of data, create a new column called `country_code` that includes this information in the dataset rather than in the filename. - -Print a list of the columns in `yt`, in addition, print the dimensions of `yt`. Finally, create the `trending_year` and `publish_year` columns for `yt`. - -[source,r] ----- -# Dr Ward summarizes how to perform Question 2 in the video. -# Here is the analogous code for this question. -# We know that all of this is new for you. -# That is why we are guiding you through this question! - -getdataframe <- function(mycountry) { - myDF <- read.csv(paste0("/depot/datamine/data/youtube/", mycountry, "videos.csv")) - myDF$country_code <- mycountry - return(myDF) -} - -countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US') - -myresults <- lapply(countries, getdataframe) - -yt <- do.call(rbind, myresults) - ----- - -**Relevant topics:** read.csv, paste0, rbind, dim, colnames - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_sduzls9h?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -From this point on, unless specified, use the `yt` data.frame to answer the questions. -==== - -Which YouTube video took the longest time to trend from the time it was published? How many years did it take to trend? - -**Relevant topics:** which.max, indexing - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Name of the YouTube video, and how long it took to trend. -- (Optional) Did you watch the video prior to the project? If so, what do you think about it? -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_n965p2wz?wid=_983291"></iframe> -++++ - -We are interested in seeing whether or not there is a difference in views between videos with ratings enabled vs. those with ratings disabled. - -Calculate the average number of views for videos with ratings enabled and those with ratings disabled. Anecdotally, does it look like disabling the ratings helps or hurts the views? - -[TIP] -==== -You can use `tapply` to solve this problem if you are comfortable with the `tapply` function. Otherwise, stay tuned in a future project where we will explore the `tapply` function in more detail. -==== - -[TIP] -==== -You _may_ need to take a careful look at the `ratings_disabled` column. What type should this column be? Make sure to convert if necessary. -==== - -**Relevant topics:** mean, tapply indexing - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Create two new columns in `yt`: - -- `balance`: the difference between `likes` and `dislikes` for a given video. -- `positive_balance`: an indicator variable that is `TRUE` if `balance` is greater than zero, and `FALSE` otherwise. - -How many videos have a positive balance? - -**Relevant topics:** sum - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -Compare videos with a positive `positive_balance` to those with a non-positive `positive_balance`. Make this comparison based on the `comment_count` and the `views` of the videos. - -To make a comparison, pick a statistic to summarize and compare `comment_count` and `views`. Examples of statistics include: `mean`, `median`, `max`, `min`, `var`, and `sd`. - -You can pick more than one statistic to compare, if you want, and each column may have its own statistic(s) to summarize it. - -**Relevant topics:** tapply, mean, sum, var, sd, max, min, median - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining what statistic you chose to summarize each column, and why. -- 1-2 sentences comparing videos with positive balance and non-positive balance based on `comment_count` and `views`. Is the result surprising to you? -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc deleted file mode 100644 index 762b1ae7a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project06.adoc +++ /dev/null @@ -1,164 +0,0 @@ -= STAT 19000: Project 6 -- Fall 2021 - -**Motivation:** `tapply` is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. If you have any familiarity with SQL, it `tapply` is very similar to working with the `GROUP BY` clause -- you first group your data using some rule, and then perform some operation for each newly created group. - -**Context:** The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called `tapply` from the "apply suite" of functions in R. - -**Scope:** r, tapply - -.Learning Objectives -**** -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -- Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc. -- Demonstrate using tapply to perform calculations on subsets of data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/amazon/tracks.csv` -- `/depot/datamine/data/amazon/tracks.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_p9ikn37a?wid=_983291"></iframe> -++++ - -Load the `tracks.csv` file into an R data.frame called `tracks`. Immediately after loading the file, run the following. - -[source,r] ----- -str(tracks) ----- - -What happens? - -[TIP] -==== -The C in CSV is not true for this dataset! You'll need to take advantage of the `sep` argument of `read.csv` to read in this dataset. -==== - -Once you've successfully read in the data, re-run the following. - -[source,r] ----- -str(tracks) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1o6uli5q?wid=_983291"></iframe> -++++ - -Great! `tapply` is a very cool, very powerful function in R. - -First, let's say that we wanted to see what the average `duration` (a column in the `tracks` data.frame) of songs were _by_ each `year` (a column in the `tracks` data.frame). If you think about how you would approach solving this problem, there are a lot of components to keep track of! - -- You don't know ahead of time how many different years are in the dataset. -- You have to associate each sum of `duration` with a specific `year`. -- Etc. - -Its a lot of work! - -In R, there is a really great library that allows us to run queries on an sqlite database and put the result directly into a dataframe. This would be the SQL and R solution to this problem. - -[source,r] ----- -library(RSQLite) - -con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db") -myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;") -head(myDF) ----- - -Use `tapply` to solve the same problem! Are your results the same? Print the first 5 results to make sure they are the same. - -[TIP] -==== -`tapply` can take a minute to get the hang of. I like to think about the first argument to `tapply` as the column of data we want to _perform an operation_ on, the second argument to `tapply` as the column of data we want to _group_ by, and the third argument as the operation (as a function, like `sum`, or `median`, or `mean` or `sd`, or `var`, etc.) we want to perform on the data. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6riy5fyl?wid=_983291"></iframe> -++++ - -Plot the results of question (2) with any appropriate plot that will highlight the duration of music by year, sequentially. What patterns do you see, if any? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_twqvie46?wid=_983291"></iframe> -++++ - -Ha! Thats not so bad! What are the `artist_name` of the artists with the highest median `duration` of songs? Sort the results of the `tapply` function in descending order and print the first 5 results. - -[CAUTION] -==== -This may take a few minutes to run -- this function is doing a lot and there are a lot of artists in this dataset! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Explore the dataset and come up with a question you want to answer. Make sure `tapply` would be useful with your investigation, and use `tapply` to calculate something interesting for the dataset. Create one or more graphics as you are working on your question. Write 1-2 sentences reviewing your findings. It could be anything, and your findings do not need to be "good" or "bad", they can be boring (much like a lot of research findings)! - -.Items to submit -==== -- Question you want to answer. -- Code used to solve this problem. -- Output (including graphic(s)) from running the code. -- 1-2 sentences reviewing your findings. -==== - -=== Question 6 (optional, 0 pts) - -Use the following SQL and R code and take a crack at solving a problem (any problem) you want to do with R and SQL. You can use the following code to help. Create a cool graphic with the results! - -[source,r] ----- -library(RSQLite) - -con <- dbConnect(SQLite(), dbname = "/depot/datamine/data/amazon/tracks.db") -myDF <- dbGetQuery(con, "SELECT year, AVG(duration) AS average_duration FROM songs GROUP BY year;") -myDF ----- - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc deleted file mode 100644 index ab102521f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project07.adoc +++ /dev/null @@ -1,219 +0,0 @@ -= STAT 19000: Project 7 -- Fall 2021 - -**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R. - -**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data! - -**Scope:** r, subset, merge, tapply - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/goodreads/csv/*.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_cvms95v9?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1bi9ssp2?wid=_983291"></iframe> -++++ - -Read the `goodreads_books.csv` into a data.frame called `books`. Let's say Dr. Ward is working on a book and new content. He is looking for advice and wants some insight from us. - -A friend told him that he should pick a month in the Summer to publish his book. - -Based on our `books` dataset, is there any evidence that certain months get higher than average rating? What month would you suggest for Dr. Ward to publish his new book? - -[TIP] -==== -Use columns `average_rating` and `publication_month` to solve this question. -==== - -[TIP] -==== -To read the data in faster and more efficiently, try the following: - -[source,r] ----- -library(data.table) -books <- fread("/path/to/data") ----- -==== - -**Relevant topics:** tapply, mean - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences comparing the publication month based on average rating. -- 1-2 sentences with your suggestion to Dr. Ward and reasoning. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5jtrrkbs?wid=_983291"></iframe> -++++ - -Create a new column called `book_size_cat` that is a categorical variable based on the number of pages a book has. - -`book_size_cat` should have 3 levels: `small`, `medium`, `large`. - -Run the code below to get different summaries and visualizations of the number of pages books have in our datasets. - -[source,r] ----- -summary(books$num_pages) -hist(books$num_pages) -hist(books$num_pages[books$num_pages <= 1000]) -boxplot(books$num_pages[books$num_pages < 4000]) ----- - -Pick the values from which to separate these levels by. Write 1-2 sentences explaining why you pick those values. - -[TIP] -==== -You can do other visualizations to determine. Have fun, there is no right or wrong. What would you consider a small, medium, and large book? -==== - -**Relevant topics:** cut - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining the values you picked to create your categorical data and why. -- The results of running `table(books$book_size_cat)`. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_uizgzw2a?wid=_983291"></iframe> -++++ - -Dr. Ward is a firm believer in constructive feedback, and would like people to provide feedback for his book. - -What recommendation would you make to Dr. Ward when it comes to book size? - -[TIP] -==== -Use the column `text_reviews_count` and compare, on average, how many text reviews the various book sizes get. -==== - -[NOTE] -==== -Association is not causation, and there are many factors that lead to people providing reviews. Your recommendation can be based on anecdotal evidence, no worries. -==== - -**Relevant topics:** tapply, mean - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences with your recommendation and reasoning. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_j5ykx9zb?wid=_983291"></iframe> -++++ - -Sometimes (often times) looking at a single summary of our data may not provide the full picture. - -Make a side-by-side boxplot for the `text_reviews_count` by `book_size_cat`. - -Does your answer to question (3) change based on your plot? - -[TIP] -==== -Take a look at the first example when you run `?boxplot`. -==== - -[TIP] -==== -You can make three boxplots if you prefer, but make sure that they all have the same y-axis limit to make the comparisons. -==== - -**Relevant topics:** boxplot - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences with your recommendation and reasoning. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_m4uiyi99?wid=_983291"></iframe> -++++ - -Repeat question (4), this time, use the `subset` function to reduce your data to books with a `text_reviews_count` less than 200. How does this change your plot? Is it a little easier to read? - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences with your recommendation and reasoning. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_yg64cj6z?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gfluoytt?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_j25ni26m?wid=_983291"></iframe> -++++ - -Read the `goodreads_book_authors.csv` into a new data.frame called `authors`. - -Use the `merge` function to combine the `books` data.frame with the `authors` data.frame. Call your new data.frame `books_authors`. - -Now, use the `subset` function to create get a subset of your data for your favorite authors. Include at least 5 authors that appear in the dataset. - -Redo question (4) using this new subset of data. Does your recommendation change at all? - -[TIP] -==== -Make sure you pay close attention to the resulting `books_authors` data.frame. The column names will be changed to reflect the merge. Instead of `text_reviews_count` you may need to use `text_reviews_count.x`, or `text_reviews_count.y`, depending on how you merged. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences with your recommendation and reasoning. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc deleted file mode 100644 index 9a1a7a62a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project08.adoc +++ /dev/null @@ -1,222 +0,0 @@ -= STAT 19000: Project 8 -- Fall 2021 - -**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code! - -**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon. - -**Scope:** r, functions - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -- Comprehend what a function is, and the components of a function in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/goodreads/csv/interactions_subset.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_vry74zoc?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_fs3cmr87?wid=_983291"></iframe> -++++ - -Read the `interactions_subset.csv` into a data.frame called `interactions`. We have provided you with the function `get_probability_of_review` below. - -After reading in the data, run the code below, and add comments explaining what the function is doing at each step. - -[source,r] ----- -# A function that, given a string (userID) and a value (min_rating) returns a value (probability_of_reviewing). -get_probability_of_review <- function(interactions_dataset, userID, min_rating) { - # FILL IN EXPLANATION HERE - user_data <- subset(interactions_dataset, user_id == userID) - - # FILL IN EXPLANATION HERE - read_user_data <- subset(user_data, is_read == 1) - - # FILL IN EXPLANATION HERE - read_user_min_rating_data <- subset(read_user_data, rating >= min_rating) - - # FILL IN EXPLANATION HERE - probability_of_reviewing <- mean(read_user_min_rating_data$is_reviewed) - - # Return the result - return(probability_of_reviewing) -} - -get_probability_of_review(interactions_dataset = interactions, userID = 5000, min_rating = 3) ----- - -Provide 1-2 sentences explaining overall what the function is doing and what arguments it requires. - -[TIP] -==== -You may want to use `fread` function from the library `data.table` to read in the data. -==== - -[source,r] ----- -library(data.table) -interactions <- fread("/path/to/dataset") ----- - -[CAUTION] -==== -Your kernel may crash! As it turns out, the `subset` function is not very memory efficient (never fully trust a function). When you launch your Jupyter Lab session, if you use 3072 MB of memory, your kernel is likely to crash on this example. If (instead) you use 5120 MB of memory when you launch your session, you should have sufficient memory to run these examples. -==== - -**Relevant topics:** function, subset - -.Items to submit -==== -- R code used to solve this problem. -- Modified `get_probability_of_review` with comments explaining each step. -- 1-2 sentences explaining overall what the function is doing. -- Number and name of arguments for the function, `get_probability_of_review`. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_2672hvky?wid=_983291"></iframe> -++++ - -We want people that use our function to be able to get results even if they don't provide a minimum rating value. - -Modify the function `get_probability_of_review` so `min_rating` has the default value of 0. Test your function as follows. - -[source,r] ----- -get_probability_of_review(interactions_dataset = interactions, userID = 5000) ----- - -Now, in R (and in most languages), you can provide the arguments out of order, as long as you provide the argument name on the left of the equals sign and the value on the right. For example the following will still work. - -[source,r] ----- -get_probability_of_review(userID = 5000, interactions_dataset = interactions) ----- - -In addition, you don't have to provide the argument names when you call the function, however, you _do_ have to place the arguments in order when you do. - -[source,r] ----- -get_probability_of_review(interactions, 5000) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_f6kdj10w?wid=_983291"></iframe> -++++ - -Our function may not be the most efficient. However, we _can_ reduce the code a little bit! Modify our function so we only use the `subset` function once, rather than 3 times. - -Test your modified function on userID 5000. Do you get the same results as above? - -Now, instead of using `subset`, just use regular old indexing in your function. Do your results agree with both versions above? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hcn9top3?wid=_983291"></iframe> -++++ - -Run the code below. Explain what happens, and why it is happening. - -[source,r] ----- -head(read_user_min_rating_data) ----- - -[TIP] -==== -Google "Scoping in R", and read. -==== - -.Items to submit -==== -- The results of running the R code. -- 1-2 sentences explaining what happened. -- 1-2 sentences explaining why it is happening. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ld0ymltw?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_mbs17lbu?wid=_983291"></iframe> -++++ - -Apply our function to the `interactions` dataset to get, for a sample of 10 users, the probability of reviewing books given that they liked the book. - -Save this probability to a vector called `prob_review`. - -To do so, determine a minimum rating (`min_rating`) value when calculating that probability. Provide 1-2 sentences explaining why you chose this value. - -[TIP] -==== -You can use the function `sample` to get a random sample of 10 users. -==== - -[TIP] -==== -You can pick any 10 users you want to compose your sample. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining why you this particular minimum rating value. -==== - -=== Question 6 - -Change the minimum rating value, and re-calculate the probability for your selected 10 users. - -Make 1 (or more) plot(s) to compare the results you got with the different minimum rating value. Write 1-2 sentences describing your findings. - - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences comparing the results for question (5) and (6). -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc deleted file mode 100644 index 76bfdbb0a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project09.adoc +++ /dev/null @@ -1,167 +0,0 @@ -= STAT 19000: Project 9 -- Fall 2021 -:page-mathjax: true - -**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code! - -**Context:** We've been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon. - -**Scope:** r, functions - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -- Comprehend what a function is, and the components of a function in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/election/*.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hrs5b8n7?wid=_983291"></iframe> -++++ - -https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, the most famous probably being fraud detection. - -[quote, wikipedia, 'https://en.wikipedia.org/wiki/Benford%27s_law'] -____ -Benford's law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30 % of the time, while 9 appears as the leading significant digit less than 5 % of the time. If the digits were distributed uniformly, they would each occur about 11.1 % of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on. -____ - -Benford's law is given by the equation below. - -$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$ - -$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$) - -Create a function called `benfords_law` that takes the argument `digit`, and calculates the probability of `digit` being the starting digit of a random number based on Benford's law above. - -Consider `digit` to be a single value. Test your function on digit 7. - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- The results of running `benfords_law(7)`. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_3inl0aaj?wid=_983291"></iframe> -++++ - -Let's make our function more user friendly. When creating functions, it is important to think where you are going to use it, and if other people may use it as well. - -Adding error catching statements can help make sure your function is not used out of context. - -Add the following error catching by creating an if statement that checks if `digit` is between 1 and 9. If not, use the `stop` function to stop the function, and return a message explaining the error or how the user could avoid it. - -Consider `digit` to be a single value. Test your new `benfords_law` function on digit 0. - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- The results of running `benfords_law(0)`. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_906xaf95?wid=_983291"></iframe> -++++ - -Our `benfords_law` function was created to calculate the value of a single digit. We have discussed in the past the advantages of having a vectorized function. - -Modify `benfords_law` to accept a vector of leading digits. Make sure `benfords_law` stops if any value in the vector `digit` is not between 1 and 9. - -Test your vectorized `benfords_law` using the following code. - -[source,r] ----- -benfords_law(0:5) -benfords_law(1:6) ----- - -[TIP] -==== -There are many ways to solve this problem. You can use for loops, use the functions `sapply` or `Vectorize`. However, the simplest way may be to take a look at our `if` statement as the function `log` is already vectorized. -==== - -**Relevant topics:** any - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Calculate the Benford's law for all possible digits (1 to 9). Create a graph to illustrate the results. You can use a barplot, a lineplot, or a combination of both. - -Make sure you add a title to your plot, play with the colors and aesthetics of your plot. Have fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_w6f7uuad?wid=_983291"></iframe> -++++ - -Now that we have and understand the theoretical probabilities of Benford's Law, how about we use it to try to find anomalies in the elections dataset? - -As we mentioned previously, Benford's Law is very commonly used in fraud detection. Fraud detection algorithms looks for anomalies in datasets based on certain criteria and flag it for audit or further exploration. - -Not every anomaly is a fraud, but it _is_ a good start. - -We will continue this in our next project, but we can start to set things up. - -Create a function called `get_starting_digit` that has one argument, `transaction_vector`. - -The function should return a vector containing the starting digit for each value in the `transaction_vector`. - -For example, `get_starting_digit(c(10, 2, 500))` should return `c(1, 2, 5)`. Make sure that the the results of `get_starting_digit` is a numeric vector. - -Test your code running the following code. - -[source,r] ----- -str(get_starting_digit(c(100,2,50,689,1))) ----- - -[TIP] -==== -There are many ways to solve this question. -==== - -**Relevant topics:* as.numeric, substr - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc deleted file mode 100644 index 9032cc890..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project10.adoc +++ /dev/null @@ -1,183 +0,0 @@ -= STAT 19000: Project 10 -- Fall 2021 - -**Motivation:** Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. In this project, we will learn to apply functions to entire vectors of data using `sapply`. - -**Context:** We've just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R. - -**Scope:** r, sapply, functions - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/election/*.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_07pgkhcb?wid=_983291"></iframe> -++++ - -Read the elections dataset from 2014 (`itcont2014.txt`) into a data.frame called `elections` using the `fread` function from the `data.table` package. - -[TIP] -==== -Make sure to use the correct argument `sep='|'` from the `fread` function. -==== - -Create a vector called `transactions_starting_digit` that gets the starting digit for each transaction value (use the `TRANSACTION_AMT` column). Be sure to use `get_starting_digit` function from the previous project. - -Take a look at the starting digits of the unique transaction amounts. Can we directly compare the results to the Benford's law to look for anomalies? Explain why or why not, and if not, what do we need to do to be able to make the comparisons? - -[TIP] -==== -Pay close attention to the results -- if you were able to directly compare, the numbers you were testing would need to be _valid_ for the benfords law function. -==== - -[TIP] -==== -What are the possible digits a number can start with? -==== - -**Relevant topics:** fread, unique, table - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining if any changes are needed in our dataset to analyze it using Benford's Law, why or why not? If so what changes are necessary? -==== - -=== Question 2 - -[TIP] -==== -Be sure to watch the video from Question 1. It covers Question 2 too. -==== - -If in question (1) you answered that there are modifications needed in the data, make the necessary modifications. - -[TIP] -==== -You _should_ need to make a modification. -==== - -Make a barplot showing the percentage of times each digit was the starting digit. - -Include in your barplot a line indicating expected percentage based on Benford's law. - -If we compared our results to Benford's Law would we consider the findings anomalous? Explain why or why not. - -**Relevant topics:** barplot, lines, points, table, prop.table, indexing - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining why or why not you think the results for this dataset are anomalous based on Benford's law. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_thxzxiap?wid=_983291"></iframe> -++++ - -Lets explore things a bit more. How does a different grouping look? To facilitate our analysis, lets create a function to replicate the steps from questions (1) and (2). - -Create a function called `compare_to_benfords` that accepts two arguments, `values` and `title`. `values` represents a vector of values to analyze using Benford's Law, and `title` will provide the `title` of our resulting plot. - -Make sure the `title` argument has a default value, so we if we don't pass an argument to it, it will still be able to run the function. - -The function should get the starting digits in `values`, perform any necessary clean up, and compare the results with the Benford's Law, graphically, by producing a plot we did in question (2). - -Note that we are simplifying things by wrapping what we did in questions (1) and (2) into a function so we can do the analysis more efficiently. - -Test your function on the `TRANSACTION_AMT` column from the `elections` dataset. Note that the results should be the same as question (2) -- even the title of your plot. - -For fair comparison, set the y-axis limits to be between 0 and 50%. - -[TIP] -==== -If you called either of the `benfords_law` or `get_starting_digit` functions _within_ your `compare_to_benfords` function, consider the following. - -What if you shared this function with your friend, who _didn't_ have access to your `benfords_law` or `get_starting_digit` functions? It wouldn't work! - -Instead, it is perfectly acceptable to _declare_ your functions _inside_ your `compare_to_benfords` function. These types of functions are called _helper_ functions. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- The results of running `compare_to_benfords(elections$TRANSACTION_AMT)`. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_c4kkn7a6?wid=_983291"></iframe> -++++ - -Let's dig into data a bit more. Using the `compare_to_benfords` function, analyze the transactions from the following entities (`ENTITY_TP`): - -- Candidate ('CAN'), -- Individual - a person - ('IND'), -- and Organization - not a committee and not a person - ('ORG'). - -Use a loop, or one of the functions in the `apply` suite to solve this problem. - -Write 1-2 sentences comparing the transactions for each type of `ENTITY_TP`. - -Before running your code, run the following code to create a 2x2 grid for our plots. - -[source,r] ----- -par(mfrow=c(1,3)) ----- - -[TIP] -==== -There are many ways to solve this problem. -==== - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- The results of running `compare_to_benfords(elections$TRANSACTION_AMT)`. -- Optional: Include the name or abbreviation of the entity in its title. -==== - -=== Question 5 - -Use the elections datasets and what you learned from the Benford's Law to explore the dataset more. - -You can compare specific states, donations to other entities, or even use datasets from other years. - -Explain what and why you are doing, and what are your conclusions. Be creative! - -.Items to submit -==== -- R code used to solve this problem. -- The results of running the R code. -- 1-2 sentences explaining what and why you are doing. -- 1-2 sentences explaining your conclusions. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc deleted file mode 100644 index 119f2a4ea..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project11.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= STAT 19000: Project 11 -- Fall 2021 - -**Motivation:** The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you've learned so far this semester to solve data-driven problems. In previous projects, we've directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you'd like. - -**Context:** You've learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of YouTube data. - -**Scope:** R - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Comprehend what a function is, and the components of a function in R. -- Demonstrate the ability to use nested apply functions to solve a data-driven problem. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/youtube/*` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_o2ycplyx?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_y8v7j5yt?wid=_983291"></iframe> -++++ - -In project 5, we used a loop to combine the various countries youtube datasets into a single dataset called `yt` (for YouTube). - -Now, we've provided you with code below to create such a dataset, with a subset of the countries we want to look at. - -[source,r] ----- -library(lubridate) - -countries <- c('US', 'DE', 'CA', 'FR') - -# Choose either the for loop or the sapply function for creating `yt` - -# EITHER use a for loop to create the data frame `yt` -yt <- data.frame() -for (c in countries) { - filename <- paste0("/depot/datamine/data/youtube/", c, "videos.csv") - dat <- read.csv(filename) - dat$country_code <- c - yt <- rbind(yt, dat) -} - -# OR use an sapply function to create the data frame `yt` -myDFlist <- lapply( countries, function(c) { - dat <- read.csv(paste0("/depot/datamine/data/youtube/", c, "videos.csv")) - dat$country_code <- c - return(dat)} ) -yt <- do.call(rbind, myDFlist) - -# convert columns to date formats -yt$trending_date <- ydm(yt$trending_date) -yt$publish_time <- ymd_hms(yt$publish_time) - -# extract the trending_year and publish_year -yt$trending_year <- year(yt$trending_date) -yt$publish_year <- year(yt$publish_time) ----- - -Take a look at the `tags` column in our `yt` dataset. Create a function called `count_tags` that has an argument called `tag_vector`. Your `count_tags` function should be the count of how many unique tags the vector, `tag_vector` contains. - -[TIP] -==== -Take a look at the `fixed` argument in `strsplit`. -==== - -You can test your function with the following code. - -[source,r] ----- -tag_test <- yt$tags[2] -tag_test -count_tags(tag_test) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_t0sd6eh6?wid=_983291"></iframe> -++++ - -Create a new column in your `yt` dataset called `n_tags` that contains the number of tags for the corresponding trending video. - -Make sure to use your `count_tags` function. Which YouTube trending video has the highest number of unique tags for videos that are trending either in the US or Germany (DE)? How many tags does it have? - -[TIP] -==== -Make sure to use the `USE.NAMES` argument from `sapply` function -==== - -[TIP] -==== -Begin by creating the new column `n_tags`. Then create a new dataset only for youtube videos trending in 'US' or 'DE'. For the subsetted dataset, get the YouTube trending video with highest number of tags. -==== - -[TIP] -==== -It should be `video_id` with value 4AelFaljd7k. -==== - -**Relevant topics:** sapply, which.max, indexing, subset - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The title of the YouTube video with the highest number of tags, and the number of tags it has. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_g7f6eejv?wid=_983291"></iframe> -++++ - -Is there an association between number of tags in a video and how many views it gets? - -Make a scatterplot with number of `views` in the x-axis and number of tags (`n_tags`) in the y-axis. Based on your plot, write 1-2 sentences about whether you think number of tags and number of views are associated or not. - -Hmmm, is a scatterplot a good choice to be able to see an association in this case? If so, explain why. If not, create a better plot for determining this, and explain why your plot is better, and try to explain if you see any association. - -[TIP] -==== -`tapply` could be useful for the follow up question. -==== - -**Relevant topics:** sapply, which.max, indexing, subset - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining if you think number of views and number of tags a youtube video has are associated or not, and why. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gjwv8nur?wid=_983291"></iframe> -++++ - -Compare the average number of views and average number of comments that the YouTube trending videos have _per trending country_. - -Is there a different behavior between countries? Are the comparisons fair? To check if we are being fair, take a look at how many youtube trending videos we have per country. - -**Relevant topics:** tapply, mean - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences comparing trending countries based on average number of views and comments. -- 1-2 sentences explaining if you think we are being fair in our comparisons, and why or why not. -==== - -=== Question 5 - -How would you compare the YouTube trending videos across the different countries? - -Make a comparison using plots and/or summary statistics. Explain what variables are you looking at, and why you are analyzing the data the way you are. Have fun with it! - -[NOTE] -==== -There are no right/wrong answers here. Just dig in a little bit and see what you can find. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining your logic. -- 1-2 sentences comparing the countries. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc deleted file mode 100644 index 62812ceeb..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project12.adoc +++ /dev/null @@ -1,191 +0,0 @@ -= STAT 19000: Project 12 -- Fall 2021 - -**Motivation:** In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. `lubridate` is a package within the famous https://www.tidyverse.org/[tidyverse], that greatly simplifies some of the most common tasks one needs to perform with date data. - -**Context:** We've been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a https://www.tidyverse.org/[tidyverse] package that adds great stand-alone value when working with dates. - -**Scope:** r - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Convert strings to dates, and format dates using the `lubridate` package. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hja54yd8?wid=_983291"></iframe> -++++ - -[WARNING] -==== -For this project, when launching your Jupyter Lab instance, please select 5000 as the amount of memory to allocate. -==== - -Read the dataset into a dataframe called `liquor`. - -We are interested in exploring time-related trends in Iowa liquor sales. What is the data type for the column `Date`? - -Try to run the following code, to get the time between the first and second sale. - -[source,r] ----- -liquor$Date[1] - liquor$Date[2] ----- - -As you may have expected, we cannot use the standard operators (like + and -) on this type. - -Create a new column named `date` to be the `Date` column but in date format using the function `as.Date()`. - -[IMPORTANT] -==== -From this point in time on, you will have 2 "date" columns -- 1 called `Date` and 1 called `date`. `Date` will be the incorrect type for a date, and `date` will be the correct type. - -This allows us to see different ways to work with the data. -==== - -You may need to define the date format in the `as.Date()` function using the argument `format`. - -Try running the following code now. - -[source,r] ----- -liquor$date[1] - liquor$date[2] ----- - -Much better! This is just 1 reason why it is important to have the data in your dataframe be of the correct type. - -[TIP] -==== -Double check that the date got converted properly. The year for `liquor$date[1]` should be in 2015. -==== - -**Relevant topics:** `read.csv`, `fread`, `as.Date`, `str` - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_bfafd5d9?wid=_983291"></iframe> -++++ - -Create two new columns in the dataset called `year` and `month` based on the `Date` column. - -Which years are covered in this dataset regarding Iowa liquor sales? Do all years have all months represented? - -Use the `as.Date` function again, and set the format to contain only the information wanted. See an example below. - -[IMPORTANT] -==== -**Update:** It came to our attention that the `substr` method previously mentioned is _much_ less memory efficient and will cause the kernel to crash (if your project writer took the time to test _both_ ideas he had, you wouldn't have had this issue (sorry)). Please use the `as.Date` method shown below. -==== - -[source,r] ----- -myDate <- as.Date('2021-11-01') -day <- as.numeric(format(myDate,'%d')) ----- - -**Relevant topics:** `substr`, `as.numeric`, `format`, `unique`, `table` - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_i58ya6hg?wid=_983291"></iframe> -++++ - -A useful package for dealing with dates is called `lubridate`. The package is part of the famous `tidyverse` suite of packages. Run the code below to load it. - -[source,r] ----- -library(lubridate) ----- - -Re-do questions 1 and 2 using the `lubridate` package. Make sure to name the columns differently, for example `date_lb`, `year_lb` and `month_lb`. - -Do you have a preference for solving the questions? Why or why not? - -**Relevant topics:** https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf[Lubridate Cheat Sheet] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Sentence explaining which method you prefer and why. -==== - -=== Question 4 - -Now that we have the columns `year` and `month`, let's explore the data for time trends. - -What is the average volume (gallons) of liquor sold per month? Which month has the lowest average volume? Does that surprise you? - -[TIP] -==== -You can change the labels in the x-axis to be months by having the argument `xaxt` in the plot function set as "n" (`xaxt="n"`) and then having the following code at the end of your plot: `axis(side=1, at=1:12, labels=month.abb)`. -==== - -**Relevant topics:** `tapply`, `plot` - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences describing your findings. -==== - -=== Question 5 - -Make a line plot for the average volume sold per month for the years of 2012 to 2015. Your plot should contain 4 lines, one for each year. - -Make sure you specify a title, and label your axes. - -Write 1-2 sentences analyzing your plot. - -[TIP] -==== -There are many ways to get an average per month. You can use `for` loops, `apply` suite with your own function, `subset`, and `tapply` with a grouping that involves both year and month. -==== - -**Relevant topics:** `plot`, `line`, `subset`, `mean`, `sapply`, `tapply` - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences analyzing your plot. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc deleted file mode 100644 index 72ebb31ac..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-project13.adoc +++ /dev/null @@ -1,210 +0,0 @@ -= STAT 19000: Project 13 -- Fall 2021 - -**Motivation:** It is always important to stay fresh and continue to hone and improve your skills. For example, games and events like https://adventofcode.com/[https://adventofcode.com/] are a great way to keep thinking and learning. Plus, you can solve the puzzles with any language you want! It can be a fun way to learn a new programming language. - -[quote, James Baker, ] -____ -Proper Preparation Prevents Poor Performance. -____ - -In this project we will continue to wade through data, with a special focus on the apply suite of functions, building your own functions, and graphics. - -**Context:** This is the _last_ project of the semester! Many of you will have already finished your 10 projects, but for those who have not, this should be a fun and straightforward way to keep practicing. - -**Scope:** r - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Convert strings to dates, and format dates using the lubridate package. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ffsbzjx9?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ttjpyhi3?wid=_983291"></iframe> -++++ - -Run the lines of code below from project (12) to read the data and format the `year` and `month`. - -[source,r] ----- -library(data.table) -library(lubridate) - -liquor <- fread('/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt') -liquor$date <- mdy(liquor$Date) -liquor$year <- year(liquor$date) -liquor$month <- month(liquor$date) ----- - -Run the code below to get a better understanding of columns `State Bottle Cost` and the `State Bottle Retail`. - -[source,r] ----- -head(liquor[,c("State Bottle Cost", "State Bottle Retail")]) -typeof(liquor$`State Bottle Cost`) -typeof(liquor$`State Bottle Retail`) ----- - -Create two new columns, `cost` and `retail` to be `numeric` versions of `State Bottle Cost` and the `State Bottle Retail` respectively. - -Once you have those two new columns, create a column called `profit` that is the profit for each sale. Which sale had the highest profit? - -[TIP] -==== -There are many ways to solve the question. _Relevant topics_ contains functions to use in some possible solutions. -==== - -**Relevant topics:** gsub, substr, nchar, as.numeric, which.max - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The date, vendor name, number of bottles sold and profit for the sale with the highest profit. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_544r8rqj?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_sf4czdad?wid=_983291"></iframe> -++++ - -We want to provide useful information based on a `Vendor Number` to help in the decision making process. - -Create a function called `createDashboard` that takes two arguments: a specific `Vendor Number` and the `liquor` data frame, and returns a plot with the average profit per year, corresponding to the profit for that `Vendor Number`. - -**Relevant topics:** tapply, plot, mean - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The results of running `createDashboard(255, liquor)`. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_oplgvpqu?wid=_983291"></iframe> -++++ - -Modify your `createDashboard` function that uses the `liquor` data frame as the default value, if the user forgets to give the name of a data frame as input to the function. - -We are going to start adding additional plots to your function. Run the code below first, before you run the code to build your plots. This will organize many plots in a single plot. - -[source,r] ----- -par(mfrow=c(1, 2)) ----- - -Note that we are creating a dashboard in this question with 1 row and 2 columns. - -Add a bar plot to your dashboard that shows the total volume sold using `Bottle Volume (ml)`. - -Make sure to add titles to your plots. - -**Relevant topics:** table, barplot - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The results of running `createDashboard(255)`. -==== - -=== Question 4 - -Modify `par(mfrow=c(1, 2))` argument to be `par(mfrow=c(2, 2))` so we can fit 2 more plots in our dashboard. - -Create a plot that shows the average number of bottles sold per month. - -**Optional:** Modify the argument `mar` in `par()` to reduce the margins between the plots in our dashboard. - -**Relevant topics:** tapply, plot, mean - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The results of running `createDashboard(255)`. -==== - -=== Question 5 - -Add a plot to complete our dashboard. Write 1-2 sentences explaining why you chose the plot in question. - -**Optional:** Add, remove, and/or modify the dashboard to contain information you find relevant. Make sure to document why you are making the changes. - -**Relevant topics:** tapply, plot, mean - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The results of running `createDashboard(255)`. -==== - -=== Question 6 (optional, 0 pts) - -`patchwork` is a very cool R package that makes for a simple and intuitive way to combine many ggplot plots into a single graphic. See https://patchwork.data-imaginist.com/[here] for details. - -Re-write your function `createDashboard` to use `patchwork` and `ggplot`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 (optional, 0 pts) - -Use your `createDashboard` function to compare 2 vendors. You can print the dashboard into a pdf using the code below. - -[source,r] ----- -pdf(file = "myFilename.pdf", # The directory and name you want to save the file in - width = 8, # The width of the plot in inches - height = 8) # The height of the plot in inches - -createDashboard(255) - -dev.off() ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc deleted file mode 100644 index fd021e64f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/19000/19000-f2021-projects.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= STAT 19000 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -* xref:fall2021/19000/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021] -* xref:fall2021/19000/19000-f2021-project01.adoc[Project 1: Getting acquainted with Jupyter Lab] -* xref:fall2021/19000/19000-f2021-project02.adoc[Project 2: Introduction to R: part I] -* xref:fall2021/19000/19000-f2021-project03.adoc[Project 3: Introduction to R: part II] -* xref:fall2021/19000/19000-f2021-project04.adoc[Project 4: Control flow in R] -* xref:fall2021/19000/19000-f2021-project05.adoc[Project 5: Vectorized operations in R] -* xref:fall2021/19000/19000-f2021-project06.adoc[Project 6: Tapply] -* xref:fall2021/19000/19000-f2021-project07.adoc[Project 7: Base R functions] -* xref:fall2021/19000/19000-f2021-project08.adoc[Project 8: Functions in R: part I] -* xref:fall2021/19000/19000-f2021-project09.adoc[Project 9: Functions in R: part II] -* xref:fall2021/19000/19000-f2021-project10.adoc[Project 10: Lists & Sapply] -* xref:fall2021/19000/19000-f2021-project11.adoc[Project 11: Review: Focus on Sapply] -* xref:fall2021/19000/19000-f2021-project12.adoc[Project 12: Review: Focus on basic graphics] -* xref:fall2021/19000/19000-f2021-project13.adoc[Project 13: Review: Focus on apply suite] - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat19000 - -=== Link - -https://piazza.com/purdue/fall2021/stat19000/home - -== Syllabus - -++++ -include::book:ROOT:partial$syllabus.adoc[] -++++ - -== Office hour schedule - -++++ -include::book:ROOT:partial$office-hour-schedule.adoc[] -++++ \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc deleted file mode 100644 index 3d9b1b754..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project01.adoc +++ /dev/null @@ -1,207 +0,0 @@ -= STAT 29000: Project 1 -- Fall 2021 - -== Mark~it~down, your first project back in The Data Mine - -**Motivation:** It's been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems -- they aren't always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`. - -We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester. - -**Scope:** Jupyter Lab, R, Python, scholar, brown, markdown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Scholar and Brown. -- Review R and Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -`/depot/datamine/data/` - -== Questions - -=== Question 1 - -In previous semesters, we've used a program called RStudio Server to run R code on Scholar and solve the projects. This year, we will be using Jupyter Lab almost exclusively. Let's being by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dv46pmsw?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -In the not-to-distant future, we will be using _both_ Scholar (https://gateway.scholar.rcac.purdue.edu) _and_ Brown (https://ondemand.brown.rcac.purdue.edu) to launch Jupyter Lab instances. For now, however, we will be using Brown. -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand uses SLURM to launch a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb of memory. It is OK to not understand what that means yet, we will learn more about this in STAT 39000. For the curious, however, if you were to open a terminal session in Scholar and/or Brown and run the following, you would see your job queued up. - -[source,bash] ----- -squeue -u username # replace 'username' with your username ----- - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2021-s2022:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2021-s2022-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2021-s2022 kernel. Click on btn:[f2021-s2022], and a fresh notebook will be created for you. - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node you are running on). -==== - -=== Question 2 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Scholar and Brown at `/depot/datamine/apps/templates/project_template.ipynb`). - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5msf7x1z?wid=_983291"></iframe> -++++ - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 3 - -Last year, while using RStudio, you probably gained a certain amount of experience using RMarkdown -- a flavor of Markdown that allows you to embed and run code in Markdown. Jupyter Lab, while very different in many ways, still uses Markdown to add formatted text to a given notebook. It is well worth the small time investment to learn how to use Markdown, and create a neat and reproducible document. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_r607ju5b?wid=_983291"></iframe> -++++ - -Create a Markdown cell in your notebook. Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`. - -[TIP] -==== -You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/ -==== - -[TIP] -==== -Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered. -==== - -[TIP] -==== -If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum): - -- A header for this section (your choice of size) that says "About". -+ -[TIP] -==== -A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`. -==== -+ -- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn. -- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Read xref:templates.adoc[the templates page] and learn how to run snippets of code in Jupyter Lab _other than_ Python. Run at least 1 example of Python, R, SQL, and bash. For SQL and bash, you can use the following snippets of code to make sure things are working properly. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_crus3z0q?wid=_983291"></iframe> -++++ - -[source, sql] ----- --- Use the following sqlite database: /depot/datamine/data/movies_and_tv/imdb.db -SELECT * FROM titles LIMIT 5; ----- - -[source,bash] ----- -ls -la /depot/datamine/data/movies_and_tv/ ----- - -For your R and Python code, use this as an opportunity to review your skills. For each language, choose at least 1 dataset from `/depot/datamine/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis, for each language. - -[TIP] -==== -You could answer _any_ question you have about your dataset you want. This is an open question, just make sure you put in a good amount of effort. Low/no-effort solutions will not receive full credit. -==== - -[IMPORTANT] -==== -Once done, submit your projects just like last year. See the xref:submissions.adoc[submissions page] for more details. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentence analysis for each of your R and Python code examples. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc deleted file mode 100644 index a7aa14149..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project02.adoc +++ /dev/null @@ -1,418 +0,0 @@ -= STAT 29000: Project 2 -- Fall 2021 - -== Navigating UNIX and using `bash` - -**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook. - -**Context:** At this point in time, our new Jupyter Lab system, using https://gateway.scholar.rcac.purdue.edu and https://gateway.brown.rcac.purdue.edu, is very new to everyone. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab. - -**Scope:** bash, Jupyter Lab - -.Learning Objectives -**** -- Distinguish differences in `/home`, `/scratch`, `/class`, and `/depot`. -- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc. -- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc. -- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc. -- Use `man` to read and learn about UNIX utilities. -- Run `bash` commands from within Jupyter Lab. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -`/depot/datamine/data/` - -== Questions - -[IMPORTANT] -==== -If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything. -==== - -[NOTE] -==== -While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8. -==== - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_k5efwaso?wid=_983291"></iframe> -++++ - -Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from either https://gateway.scholar.rcac.purdue.edu or https://gateway.brown.rcac.purdue.edu. - -Explore the settings, and make at least 2 modifications to your environment, and list what you've changed. - -Here are some settings Kevin likes: - -- menu:Settings[JupyterLab Theme > JupyterLab Dark] -- menu:Settings[Text Editor Theme > material] -- menu:Settings[Text Editor Key Map > vim] -- menu:Settings[Terminal Theme > Dark] -- menu:Settings[Advanced Settings Editor > Notebook > codeCellConfig > lineNumbers > true] -- menu:Settings[Advanced Settings Editor > Notebook > kernelShutdown > true] -- menu:Settings[Advanced Settings Editor > Notebook > codeCellConfig > fontSize > 16] - -Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings. - -- menu:Settings[Text Editor Key Map > emacs] - -[IMPORTANT] -==== -Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc. -==== - -.Items to submit -==== -- List (using a markdown cell) of the modifications you made to your environment. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_0cjqpz7p?wid=_983291"></iframe> -++++ - -In the previous project, we used the `ls` command to list the contents of a directory as an example of running bash code using the `f2021-s2022` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab. - -The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example. - -[source,ipython] ----- -!ls - -import pandas as pd -myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) -myDF.head() ----- - -The second, is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`. - -[source,bash] ----- -# man is short for manual -# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down. -man man ----- - -What is the _absolute path_ of the default directory of your `bash` shell? - -**Relevant topics:** xref:book:unix:pwd.adoc[pwd] - -.Items to submit -==== -- The full filepath of the default directory (home directory). Ex: Kevin's is: `/home/kamstut`. -- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ra6ke1wx?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_npkkp11r?wid=_983291"></iframe> -++++ - -It is critical to be able to navigate a UNIX-like operating system. It is more likely than not that you will need to use a UNIX-like system at some point in your career. Perform the following actions, in order, using the `bash` shell. - -[NOTE] -==== -I would recommend using a code cell with the magic `%%bash` to make sure that you are using the correct shell, and so your work is automatically saved. -==== - -. Write a command to navigate to the directory containing the datasets used in this course: `/depot/datamine/data/`. -. Print the current working directory, is the result what you expected? Output the `$PWD` variable, using the `echo` command. -. List the files within the current working directory (excluding subfiles). -. Without navigating out of `/depot/datamine/data/`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files. -. Return to your home directory. -. Write a command to confirm that you are back in the appropriate directory. - -[NOTE] -==== -`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/kamstut` is the _absolute path_ of Kevin's home directory. There is a folder called `home` inside the root `/` directory. Inside `home` is another folder named `kamstut`, which is Kevin's home directory. -==== - -**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:echo.adoc[echo], xref:book:unix:ls.adoc[ls] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4dn6j15w?wid=_983291"></iframe> -++++ - -When running the `ls` command, you may have noticed two oddities that appeared in the output: "." and "..". `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/depot/datamine/data` directory, the `.` refers to the `/depot/datamine/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/depot/datamine/data/yelp` directory. - -[source,bash] ----- -ls -la /depot/datamine/data/yelp/. ----- - -`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/depot/datamine/data` directory, the `..` refers to the parent directory, `/depot/datamine`. - -Any path that contains either `.` or `..` is called a _relative path_. Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_. - -. Write a single command to navigate to our modulefiles directory: `/depot/datamine/opt/modulefiles` -. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path. - -**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:special-symbols.adoc[. & .. & ~] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_kb21hk61?wid=_983291"></iframe> -++++ - -Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands. - -[source,bash] ----- -cd -cd ~ -cd $HOME -cd /home/$USER ----- - -This is typically where you will work, and where you will store your work (for instance, your completed projects). At the time of writing this, the `$HOME` directories on Brown and Scholar are **not** synced. What this means is, files you create on one cluster _will not_ be available on the other cluster. To move files between clusters, you will need to copy them using `scp` or `rsync`. - -[NOTE] -==== -`$HOME` and `$USER` are environment variables. You can see what they are by typing `echo $HOME` and `echo $USER`. Environment variables are variables that are set by the system, or by the user. To get a list of your terminal session's environment variables, type `env`. -==== - -The `depot` space is a network file system (as is the `home` space, albeit on a different system). It is attached to the root directory on all of the nodes in the cluster. One convenience that this provides is files in this space exist everywhere the filesystem is mounted! In summary, files added anywhere in `/depot/datamine` will be available on _both_ Scholar and Brown. Although you will not utilize this space _very_ often (other than to access project datasets), this is good information to know. - -There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located in the same place on either cluster: `/scratch/$RCAC_CLUSTER/$USER`. `scratch` is meant for use with _really_ large chunks of data. The quota on Brown is 200TB and 2 million files. The quota on Scholar is 1TB and 2 million files. You can see your quota and usage on each system by running the following command. - -[source,bash] ----- -myquota ----- - -[TIP] -==== -`$RCAC_CLUSTER` and `$USER` are environment variables. You can see what they are by typing `echo $RCAC_CLUSTER` and `echo $USER`. `$RCAC_CLUSTER` contains the name of the cluster (for this course, "scholar" or "brown"), and `$USER` contains the username of the current user. -==== - -. Navigate to your `scratch` directory. -. Confirm you are in the correct location using a command. -. Execute the `tokei` command, with input `~dgc/bin`. -+ -[NOTE] -==== -Doug Crabill is a the compute wizard for the Statistics department here at Purdue. `~dgc/bin` is a directory he has made publicly available with a variety of useful scripts. -==== -+ -. Output the first 5 lines and last 5 lines of `~dgc/bin/union`. -. Count the number of lines in the bash script `~dgc/bin/union` (using a UNIX command). -. How many bytes is the script? -+ -[CAUTION] -==== -Be careful. We want the size of the script, not the disk usage. -==== -+ -. Find the location of the `tokei` command. - -[TIP] -==== -When you type `myquota` on Scholar or Brown there are sometimes warnings about xauth. If you get a warning that says something like the following warning, you can safely ignore it. - -[quote, , Scholar/Brown] -____ -Warning: untrusted X11 forwarding setup failed: xauth key data not generated -____ -==== - -[TIP] -==== -Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. - -[source,bash] ----- -man wc ----- - -You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples. - -[source,bash] ----- -# using the default wc command. "/depot/datamine/data/flights/1987.csv" is the first "argument" given to the command. -wc /depot/datamine/data/flights/1987.csv - -# to count the lines, use the -l option -wc -l /depot/datamine/data/flights/1987.csv - -# to count the words, use the -w option -wc -w /depot/datamine/data/flights/1987.csv - -# you can combine options as well -wc -w -l /depot/datamine/data/flights/1987.csv - -# some people like to use a single tack `-` -wc -wl /depot/datamine/data/flights/1987.csv - -# order doesn't matter -wc -lw /depot/datamine/data/flights/1987.csv ----- -==== - -**Relevant topics:** xref:book:unix:pwd.adoc[pwd], xref:book:unix:cd.adoc[cd], xref:book:unix:head.adoc[head], xref:book:unix:tail.adoc[tail], xref:book:unix:wc.adoc[wc], xref:book:unix:du.adoc[du], xref:book:unix:which.adoc[which], xref:book:unix:type.adoc[type] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_v6kwns2h?wid=_983291"></iframe> -++++ - -Perform the following operations. - -. Navigate to your scratch directory. -. Copy the following file to your current working directory: `/depot/datamine/data/movies_and_tv/imdb.db`. -. Create a new directory called `movies_and_tv` in your current working directory. -. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch). -. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory. -. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents. -. Remove the file, `im_empty.txt`, from your scratch directory. - -**Relevant topics:** xref:book:unix:cp.adoc[cp], xref:book:unix:rm.adoc[rm], xref:book:unix:touch.adoc[touch], xref:book:unix:cd.adoc[cd] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_vg0w9rpf?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -This question should be performed by opening a terminal window. menu:File[New > Terminal]. Enter the result/content in a markdown cell in your notebook. -==== - -Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a _really_ useful feature, that you may not know is there unless you are told! - -Here is the way it works, in the most common case -- using `cd`. Have a destination in mind, for example `/depot/datamine/data/flights/`. Type `cd /depot/d`, and press tab. You should be presented with a large list of options starting with `d`. Type `a`, then press tab, and you will be presented with an even smaller list. This time, press tab repeatedly until you've selected `datamine`. You can then continue to type and press tab as needed. - -Below is an image of the absolute path of a file in the Data Depot. Use `cat` and tab completion to print the contents of that file. - -image::figure03.webp[Tab completion, width=792, height=250, loading=lazy, title="Tab completion"] - -.Items to submit -==== -- The content of the file, `hello_there.txt`, in a markdown cell in your notebook. -==== - -=== Question 8 (optional, 0 pts, but recommended) - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_s7sphj5m?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -For this question, you will most likely want to launch a terminal. To launch a terminal click on menu:File[New > Terminal]. No need to input this question in your notebook. -==== - -. Use `vim`, `emacs`, or `nano` to create a new file in your scratch directory called `im_still_here.sh`. Add the following contents to the file, save, and close it. -+ -[source,bash] ----- -#!/bin/bash - -i=0 - -while true -do - echo "I'm still here! Count: $i" - sleep 1 - ((i+=1)) -done ----- -+ -. Confirm the contents of the file using `cat`. -. Try and run the program by typing `im_still_here.sh`. -+ -[NOTE] -==== -As you can see, simply typing `im_still_here.sh` will not work. You need to run the program with `./im_still_here.sh`. The reason is, by default, the operating system looks at the locations in your `$PATH` environment variable for executables to execute. `im_still_here.sh` is not in your `$PATH` environment variable, so it will not be found. In order to make it clear _where_ the program is, you need to run it with `./`. -==== -+ -. Instead, try and run the program by typing `./im_still_here.sh`. -+ -[NOTE] -==== -Uh oh, another warning. This time, you get a warning that says something like "permission denied". In order to execute a program, you need to grant the program execute permissions. To grant execute permissions for your program, run the following command. - -[source,bash] ----- -chmod +x im_still_here.sh ----- -==== -+ -. Try and run the program by typing `./im_still_here.sh`. -. The program should begin running, printing out a count every second. -. Suspend the program by typing kbd:[Ctrl+Z]. -. Run the program again by typing `./im_still_here.sh`, then suspend it again. -. Run the command, `jobs`, to see the jobs you have running. -. To continue running a job, use either the `fg` command or `bg` command. -+ -[TIP] -==== -`fg` stands for foreground and `bg` stands for background. - -`fg %1` will continue to run job 1 in the foreground. During this time you will not have the shell available for you to use. To re-suspend the program, you can press kbd:[Ctrl+Z] again. - -`bg %1` will run job 1 in the background. During this time the shell will be available to use. Try running `ls` to demonstrate. Note that the program, although running in the background, will still be printing to your screen. Although annoying, you can still run and use the shell. In this case, however, you will most likely want to stop running this program in the background due to its disruptive behavior. kdb:[Ctrl+Z] will will no longer suspend the program, because this program is running in the background, not foreground. To suspend the program, first send it to the foreground with `fg %1`, _then_ use kbd:[Ctrl+Z] to suspend it. -==== - -Experiment moving the jobs to the foreground, background, and suspended until you feel comfortable with it. It is a handy trick to learn! - -[TIP] -==== -By default, a program is launched in the foreground. To run a program in the background at the start, and the command with a `&`, like in the following example. - -[source,bash] ----- -./im_still_here.sh & ----- -==== - -.Items to submit -==== -- Code used to solve this problem. Since you will need to use kbd:[Ctrl+Z], and things of that nature, when what you are doing isn't "code", just describe what you are did. For example, if I press kbd:[Ctrl+Z], I would say "I pressed kbd:[Ctrl+Z]". -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc deleted file mode 100644 index 76027c880..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project03.adoc +++ /dev/null @@ -1,191 +0,0 @@ -= STAT 29000: Project 3 -- Fall 2021 - -== Regular expressions, irregularly satisfying, introduction to `grep` and regular expressions - -**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process -- after all, projects in industry will not typically provide you with a path to your dataset and call it a day. `grep` is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in. - -[NOTE] -==== -Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn't be, they can be awesome), you should definitely take the time to learn regular expressions. -==== - -**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python. - -**Scope:** `grep`, regular expression basics, utilizing regular expression tools in R and Python - -.Learning Objectives -**** -- Use `grep` to search for patterns within a dataset. -- Use `cut` to section off and slice up data from the command line. -- Use `wc` to count the number of lines of input. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -`/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_u72l7lgf?wid=_983291"></iframe> -++++ - -`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. - -Let's assume for a second that we _didn't_ provide you with the location of this projects dataset, and you didn't know the name of the file either. With all of that being said, you _do_ know that it is the only dataset with the text "That's the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. (When you search for this sentence in the file, make sure that you type the single quote in "That's" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence.) - -Write a `grep` command that finds the dataset. You can start in the `/depot/datamine/data` directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a `c` inside the `/depot/datamine/data` directory. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In the previous project, you learned about a command that could quickly print out the first _n_ lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and _only_ the first line of the file. - -Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the `complaint_id`, `consumer_complaint_narrative`, and the `state`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hokcx3fx?wid=_983291"></iframe> -++++ - -Imagine a scenario where we are dealing with a _much_ bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the `complaint_id`, `state`, `consumer_complaint_narrative`, and `tags`. - -Use UNIX tools to, in one line, create a _new_ dataset called `southeast.csv` that only contains the data for the five states mentioned above, and only the columns listed above. - -[TIP] -==== -Be careful you don't accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL"). -==== - -How many rows of data remain? How many megabytes is the new file? Use `cut` to isolate _just_ the data we ask for. For example, _just_ print the number of rows, and _just_ print the value (in Mb) of the size of the file. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -We want to isolate some of our southeast complaints. Return rows from our new dataset, `southeast.csv`, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single `grep` command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -If you pay attention to the `consumer_complaint_narrative` column, you'll notice that some of the narratives contain dollar amounts in curly braces `{` and `}`. Use `grep` to find the narratives that contain at least one dollar amount enclosed in curly braces. Use `head` to limit output to only the first 5 results. - -[TIP] -==== -Use the option `--color=auto` to get some nice, colored output (if using a terminal). -==== - -[TIP] -==== -Use the option `-E` to use extended regular expressions. This will make your regular expressions less messy (less escaping). -==== - -[NOTE] -==== -There are instances like `{>= $1000000}` and `{ XXXX }`. The first example qualifies, but the second doesn't. Make sure the following are matched: - -- {$0.00} -- { $1,000.00 } -- {>= $1000000} -- { >= $1000000 } - -And that the following are _not_ matched: - -- { XXX } -- {XXX} -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -As mentioned earlier on, every major language has some sort of regular expression package. Use either the `re` package in Python (or string methods in `pandas`, for example, `findall`), or the `grep`, `grepl`, and `stringr` packages in R to perform the same operation in question (5). - -[TIP] -==== -If you are using `pandas`, there will be 3 types of results: lists of strings, empty lists, and `NA` values. You can convert your empty lists to `NA` values like this. - -[source,python] ----- -dat['amounts'] = dat['amounts'].apply(lambda x: pd.NA if x==[] else x) ----- - -Then, dat['amounts'] will be a `pandas` Series with values `pd.NA` or a list of strings. Which you can filter like this. - -[source,python] ----- -dat['amounts'].loc[dat['amounts'].notna()] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 (optional, 0 pts) - -As mentioned earlier on, every major language has some sort of regular expression package. Use either the `re` package in Python, or the `grep`, `grepl`, and `stringr` packages in R to create a new column in your data frame (`pandas` or R data frame) named `amounts` that contains a semi-colon separated string of dollar amounts _without_ the dollar sign. For example, if the dollar amounts are $100, $200, and $300, the amounts column should contain `100.00;200.00;300.00`. - -[TIP] -==== -One good way to do this is to use the `apply` method on the `pandas` Series. - -[source,python] ----- -dat['amounts'] = dat['amounts'].apply(some_function) ----- -==== - -[TIP] -==== -This is one way to test if a value is `NA` or not. - -[source,python] ----- -isinstance(my_list, type(pd.NA)) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc deleted file mode 100644 index 2e953f88e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project04.adoc +++ /dev/null @@ -1,164 +0,0 @@ -= STAT 29000: Project 4 -- Fall 2021 - -== Extracting and summarizing data in bash - -**Motivation:** Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs. - -**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping. - -**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping - -.Learning Objectives -**** -- Use `cut` to section off and slice up data from the command line. -- Use piping to string UNIX commands together. -- Use `sort` and it's options to sort data in different ways. -- Use `head` to isolate n lines of output. -- Use `wc` to summarize the number of lines in a file or in output. -- Use `uniq` to filter out non-unique lines. -- Use `grep` to search files effectively. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/stackoverflow/unprocessed/*` -- `/depot/datamine/data/stackoverflow/processed/*` -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_tkurpp0y?wid=_983291"></iframe> -++++ - -One of the first things to do when first looking at a dataset is reading the first few lines of data in the file. Typically, there will be some headers which describe the data, _and_ you get to see what some of the data looks like. Use the UNIX `head` command to read the first few lines of the data in `unprocessed/2011.csv`. - -As you will quickly see, this dataset is just too wide -- there are too many columns -- to be useful. Let's try and count the number of columns using `head`, `tr`, and `wc`. If we can get the first row, replace `,`'s with newlines, then use `wc -l` to count the number of lines, this should work, right? What happens? - -[TIP] -==== -The newline character in UNIX is `\n`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_c9ivsdpy?wid=_983291"></iframe> -++++ - -As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in `processed/2011.csv`. How many columns are there? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wjf7g7py?wid=_983291"></iframe> -++++ - -Let's switch gears, and look at a larger dataset with more data to analyze. Check out `iowa_liquor_sales_cleaner.txt`. What are the 5 largest orders by number of bottles sold? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ql9f9end?wid=_983291"></iframe> -++++ - -What are the different sizes (in ml) that a bottle of liquor comes in? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Which store has the most invoices? There are 2 columns you could potentially use to solve this problem, which should you use and why? For this dataset, does it end up making a difference? - -[NOTE] -==== -This may take a few minutes to run. Grab a coffee. To prevent wasting time, try practicing on the `head` of the data instead of the entire data. -==== - -[IMPORTANT] -==== -Be _very_ careful when using `uniq`. Read the man pages for `uniq`, otherwise, you may not get the correct solution. - -[source,bash] ----- -man uniq ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -`sort` is a particularly powerful function, albeit not always the most user friendly when compared to other tools. - -For the largest sale (in USD), what was the volume sold in liters? - -For the largest sale (in liters of liquor sold), what was the total cost (in USD)? - -[TIP] -==== -Use the `-k` option with sort to solve these questions. -==== - -[TIP] -==== -To remove a dollar sign from text using `tr`, do the following. - -[source,bash] ----- -tr -d '$' ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -Use `head`, `grep`, `sort`, `uniq`, `wc`, and any other UNIX utilities you feel comfortable using to answer a data-driven question about the `iowa_liquor_sales_cleaner.txt` dataset. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc deleted file mode 100644 index b2c971096..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project05.adoc +++ /dev/null @@ -1,199 +0,0 @@ -= STAT 29000: Project 5 -- Fall 2021 - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the first project where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_g4zf3xdo?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_u5yq4muu?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_l5m9s83y?wid=_983291"></iframe> -++++ - -While the UNIX tools we've used up to this point are very useful, `awk` enables many new capabilities, and can even replace major functionality of other tools. - -In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using `awk`. - -Similarly, we've used `head` to print the header line. Use `awk` to do the same. - -Similarly, we've used `wc` to count the number of lines in the dataset. Use `awk` to do the same. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_v88zb7x2?wid=_983291"></iframe> -++++ - -In a previous question, we used `sort` in combination with `uniq` to find the stores with the most number of sales. - -Use `awk` to find the 10 stores with the most number of sales. In a previous solution, our output was minimal -- we had a count and a store number. This time, take some time to format the output nicely, _and_ use the store number to find the count (not store name). - -[TIP] -==== -Sorting an array by values in `awk` can be confusing. Check out https://stackoverflow.com/questions/5342782/sort-associative-array-with-awk[this excellent stackoverflow post] to see a couple of ways to do this. "Edit 2" is the easiest one to follow. -==== - -[NOTE] -==== -You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to _print_ the store names with the output. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_iyl7khfu?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jxmozwl4?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jpst446s?wid=_983291"></iframe> -++++ - -Calculate the total number of sales (in USD) by county. Do this using any UNIX commands you have available. Then, do this using _only_ `awk`. - -[TIP] -==== -`gsub` is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$`'s in field 2 with nothing by: - ----- -gsub(/\$/, "", $2) ----- -==== - -[NOTE] -==== -The `gsub` operation happens in-place. In a nutshell, what this means is that the original field, `$2` is replaced with the result of the `gsub` operation. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hk6u89o6?wid=_983291"></iframe> -++++ - -Use `awk` and piping to create a new dataset with the following columns, for every store, by month: - -- `month_number`: the month number (01-12) -- `year`: the year (4-digit year, e.g., 2015) -- `store_name`: store name -- `volume_sold`: total volume sold -- `sold_usd`: total amount sold in USD - -Call the new dataset `sales_by_store.csv`. - -[TIP] -==== -Feel free to use the store name as a key for simplicity. -==== - -[TIP] -==== -`split` is another powerful function in `awk` that allows you to split a string into multiple fields. You could, for example, extract the year from the date field as follows. - -[source,awk] ----- -split($2, dates, "/", seps); ----- - -Then, you can access the year using `dates[3]`. -==== - -[TIP] -==== -You can use multiple values as a key in `awk`. This is a cool trick to count or calculate something by year, for example. - -[source,awk] ----- -myarray[$4dates[3]]++ ----- - -Here, `$4` is the 4th field, `dates[3]` is the year. The resulting key would be something like "My Store Name2014", and we would have a new key (and associated value) for each store/year combination. In the provided code (below), Dr Ward suggests the use of a triple key, which includes the store name, the month, and the year. -==== - -[TIP] -==== -Dr Ward walks you through a method of solution for this problem, in the video - -[source,awk] ----- -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | - awk -F\; 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" } - {gsub(/\$/, "", $22); split($2, dates, "/", seps); - mysales[$4";"dates[1]";"dates[3]] += $22; - myvolumes[$4";"dates[1]";"dates[3]] += $24; - } - END{ for (mytriple in mysales) {print mytriple";"mysales[mytriple]";"myvolumes[mytriple]}}' >sales_by_store.csv ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Use `awk` to count how many times each store has sold more than $500,000 in a month. Output should be similar to the following. Sort the output from highest count to lowest. - ----- -store_name,count ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc deleted file mode 100644 index ac4a5677d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project06.adoc +++ /dev/null @@ -1,641 +0,0 @@ -= STAT 29000: Project 6 -- Fall 2021 - -== The anatomy of a bash script - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, bash scripts, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -- Write bash scripts to automate potential repeated tasks. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/election/*` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_unbotb66?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "note" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on. -==== - -We now have a grip on a variety of useful tools, that are often used together using pipes and redirections. As you start to see, "one-liners" can start to become a bit unwieldy. In these cases, wrapping everything into a bash script can be a good solution. - -Imagine for a minute, that you have a single file that is continually appended to by another system. Let's say this file is `/depot/datamine/data/election/itcont1990.txt`. Every so often, your manager asks you to generate a summary of the data in this file. Every time you do this, you have to dig through old notes to remember how you did this previously. Instead of constantly doing this manual process, you decide to write a script to handle this for you! - -Write a bash script to generate a summary of the data in `/depot/datamine/data/election/itcont1990.txt`. The summary should include the following information, in the following format. - -.... -120 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont1990.txt -Largest donor: -Most common donor state: NY -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- -.... - -[NOTE] -==== -For this question, assume that the data file will _always_ be in the same location. -==== - -[source,bash] ----- -#!/bin/bash - -FILE=/depot/datamine/data/election/itcont1990.txt - -RECORDS_READ=`wc -l $FILE | awk '{print $1}'` - -awk -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ - print RECORDS_READ" RECORDS READ\n----------------"; -}{ - donor_total_by_name[$8] += $15; - most_common_donor_by_state[$10]++; - donor_total_by_state[$10] += $15; -}END{ - PROCINFO["sorted_in"] = "@val_num_desc"; - print "File: "FILENAME; - - ct=0; - - for (i in donor_total_by_name) { - if (ct < 1) { - print "Largest donor: " i; - ct++; - } - }; - - ct=0; - - for (i in most_common_donor_by_state) { - if (ct < 1) { - print "Most common donor state: " i; - ct++; - } - } - - print "Total donations in USD by state:"; - - for (i in donor_total_by_state) { - if (i != "STATE" && i != "") { - print "\t- " i ": " donor_total_by_state[i]; - } - } - - print "----------------"; - -}' "$FILE" ----- - -In order to run this script, you will need to paste the contents into a new file called `firstname-lastname-q1.sh` in your `$HOME` directory. In a new bash cell, run it as follows. - -[source,ipython] ----- -%%bash - -chmod +x $HOME/firstname-lastname-q1.sh -$HOME/firstname-lastname-q1.sh ----- - -That `chmod` command is necessary to ensure that you can execute the script. - -Create the script and run the script in a bash cell. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6f9gbt4l?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ucc4u6rf?wid=_983291"></iframe> -++++ - -Your manager loves your script, but wants you to modify it so it works with any file formatted the same way. A new system is being installed that saves new data into new files rather than appending to the same file. - -Modify the script from question (1) to accept an argument that specifies the file to process. - -Start by copying the cold script from question (1) into a new file called `firstname-lastname-q2.sh`. - -[source,ipython] ----- -%%bash - -cp $HOME/firstname-lastname-q1.sh $HOME/firstname-lastname-q2.sh ----- - -Then, test the updated script out on `/depot/datamine/data/election/itcont2000.txt`. - -[source,ipython] ----- -%%bash - -$HOME/firstname-lastname-q2.sh /depot/datamine/data/election/itcont2000.txt ----- - -[TIP] -==== -You can edit your scripts directly within Jupyter Lab by right clicking the files and opening in the editor. -==== - -[TIP] -==== -The only difference between the two scripts are the new script you will be able to replace the $FILE argument to the `wc` command with something else. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_stc9vywg?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_vzi3uj7h?wid=_983291"></iframe> -++++ - -Modify your script once again to accept _n_ arguments, each a path to another file to generate a summary for. - -Start by copying the cold script from question (2) into a new file called `firstname-lastname-q3.sh`. - -[source,ipython] ----- -%%bash - -cp $HOME/firstname-lastname-q2.sh $HOME/firstname-lastname-q3.sh ----- - -You should be able to run the script as follows. - -[source,ipython] ----- -%%bash - -$HOME/firstname-lastname-q3.sh /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt ----- - -.... -155 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont2000.txt -Largest donor: -Most common donor state: NY -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- - -120 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont1990.txt -Largest donor: -Most common donor state: NY -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- -.... - -[TIP] -==== -Again, the modification that will need to be made here aren't so bad at all! If you just wrap the entirety of question (2)'s solution in a for loop where you loop through each argument, you'll just need to make sure you change the $FILE argument to the `wc` command to be the argument you are setting in each loop. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_vwqdigob?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "tip" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on, and run the example we provide. -==== - -You are _particularly_ interested in donors from your alma mater, https://purdue.edu[Purdue University]. Modify your script from question (3) yet again. This time, add a flag, that, when present, will include the name and amount for each donor where the word "purdue" (case insensitive) is present in the `EMPLOYER` column. - -[source,ipython] ----- -%%bash - -$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt ----- - -.... -155 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont2000.txt -Largest donor: ASARO, SALVATORE -Most common donor state: NY -Purdue donors: -- John Smith: 500 -- Alice Bob: 1000 -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- - -120 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont1990.txt -Largest donor: ASARO, SALVATORE -Most common donor state: NY -Purdue donors: -- John Smith: 500 -- Alice Bob: 1000 -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- -.... - -[TIP] -==== -https://stackoverflow.com/a/29754866[This] stackoverflow response has an excellent template using `getopt` to parse your flags. Use this as a "start". -==== - -[TIP] -==== -You may want to comment out or delete the part of the template that limits your non-flag arguments to one. -==== - -[source,bash] ----- -#!/bin/bash - -# More safety, by turning some bugs into errors. -# Without `errexit` you don’t need ! and can replace -# PIPESTATUS with a simple $?, but I don’t do that. -set -o errexit -o pipefail -o noclobber -o nounset - -# -allow a command to fail with !’s side effect on errexit -# -use return value from ${PIPESTATUS[0]}, because ! hosed $? -! getopt --test > /dev/null -if [[ ${PIPESTATUS[0]} -ne 4 ]]; then - echo 'I’m sorry, `getopt --test` failed in this environment.' - exit 1 -fi - -OPTIONS=p -LONGOPTS=purdue - -# -regarding ! and PIPESTATUS see above -# -temporarily store output to be able to check for errors -# -activate quoting/enhanced mode (e.g. by writing out “--options”) -# -pass arguments only via -- "$@" to separate them correctly -! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@") -if [[ ${PIPESTATUS[0]} -ne 0 ]]; then - # e.g. return value is 1 - # then getopt has complained about wrong arguments to stdout - exit 2 -fi -# read getopt’s output this way to handle the quoting right: -eval set -- "$PARSED" - -p=n -# now enjoy the options in order and nicely split until we see -- -while true; do - case "$1" in - -p|--purdue) - p=y - shift - ;; - --) - shift - break - ;; - *) - echo "Programming error" - exit 3 - ;; - esac -done - -# handle non-option arguments -# if [[ $# -ne 1 ]]; then -# echo "$0: A single input file is required." -# exit 4 -# fi - -for file in "$@" -do - RECORDS_READ=`wc -l $file | awk '{print $1}'` - - awk -v PFLAG="$p" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ - print RECORDS_READ" RECORDS READ\n----------------"; - }{ - - if ($8 != "") { - donor_total_by_name[$8] += $15; - } - most_common_donor_by_state[$10]++; - donor_total_by_state[$10] += $15; - - # see if "purdue" appears in line - if (PFLAG == "y") { - has_purdue = match(tolower($0), /purdue/) - if (has_purdue != 0) { - purdue_total_by_name[$8] += $15; - } - } - - }END{ - PROCINFO["sorted_in"] = "@val_num_desc"; - print "File: "FILENAME; - - ct=0; - - for (i in donor_total_by_name) { - if (ct < 1) { - print "Largest donor: " i; - ct++; - } - }; - - ct=0; - - for (i in most_common_donor_by_state) { - if (ct < 1) { - print "Most common donor state: " i; - ct++; - } - } - - if (PFLAG == "y") { - print "Purdue donors:"; - for (i in purdue_total_by_name) { - print "\t- " i ": " purdue_total_by_name[i]; - } - } - - print "Total donations in USD by state:"; - - for (i in donor_total_by_state) { - if (i != "STATE" && i != "") { - print "\t- " i ": " donor_total_by_state[i]; - } - } - - print "----------------\n"; - - }' $file -done ----- - -Please copy and paste this code into a new script called `firstname-lastname-q4.sh` and run it. - -[source,ipython] ----- -%%bash - -$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -[IMPORTANT] -==== -Originally, this project was a bit more involved than intended. Instead of writing this script from scratch, I would like you to fill in the parts of the script with the text FIXME, and then test out the script with the commands provided. -==== - -Your manager liked that new feature, however, she thinks the tool would be better suited to search the `EMPLOYER` column for a specific string, and then handle this generically, rather than just handling the specific case of Purdue. - -Modify your script from question (4). Accept one and only one flag `-e` or `--employer`. This flag should take a string as an argument, and then search the `EMPLOYER` column for that string. Then, the script will print out the results. Only include the top 5 donors from an employer. The following is an example if we chose to search for "ford". - -[source,bash] ----- -$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt ----- - -.... -155 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont1990.txt -Largest donor: ASARO, SALVATORE -Most common donor state: NY -ford donors: -- John Smith: 500 -- Alice Bob: 1000 -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- - -120 RECORDS READ ----------------- -File: /depot/datamine/data/election/itcont2000.txt -Largest donor: ASARO, SALVATORE -Most common donor state: NY -ford donors: -- John Smith: 500 -- Alice Bob: 1000 -Total donations in USD by state: -- NY: 100000 -- CA: 50000 -... ----------------- -.... - -[source,bash] ----- -#!/bin/bash - -# More safety, by turning some bugs into errors. -# Without `errexit` you don’t need ! and can replace -# PIPESTATUS with a simple $?, but I don’t do that. -set -o errexit -o pipefail -o noclobber -o nounset - -# -allow a command to fail with !’s side effect on errexit -# -use return value from ${PIPESTATUS[0]}, because ! hosed $? -! getopt --test > /dev/null -if [[ ${PIPESTATUS[0]} -ne 4 ]]; then - echo 'I’m sorry, `getopt --test` failed in this environment.' - exit 1 -fi - -OPTIONS=e: -LONGOPTS=employer: - -# -regarding ! and PIPESTATUS see above -# -temporarily store output to be able to check for errors -# -activate quoting/enhanced mode (e.g. by writing out “--options”) -# -pass arguments only via -- "$@" to separate them correctly -! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "$@") -if [[ ${PIPESTATUS[0]} -ne 0 ]]; then - # e.g. return value is 1 - # then getopt has complained about wrong arguments to stdout - exit 2 -fi -# read getopt’s output this way to handle the quoting right: -eval set -- "$PARSED" - -e=- -# now enjoy the options in order and nicely split until we see -- -while true; do - case "$1" in - -e|--employer) - e="$2" - shift 2 - ;; - --) - shift - break - ;; - *) - echo "Programming error" - exit 3 - ;; - esac -done - -# handle non-option arguments -# if [[ $# -ne 1 ]]; then -# echo "$0: A single input file is required." -# exit 4 -# fi - -for file in "$@" -do - RECORDS_READ=`wc -l $file | awk '{print $1}'` - - awk -v EFLAG="$FIXME" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ <1> - print RECORDS_READ" RECORDS READ\n----------------"; - } - { - - if ($8 != "") { - donor_total_by_name[$8] += $15; - } - most_common_donor_by_state[$10]++; - donor_total_by_state[$10] += $15; - - # see if search string appears in line - if (EFLAG != "") { - has_string = match(tolower($12), EFLAG) - if (has_string != 0) { - employer_total_by_name[$8] += $15; - } - } - - }END{ - PROCINFO["sorted_in"] = "@val_num_desc"; - print "File: "FILENAME; - - ct=0; - - for (i in donor_total_by_name) { - if (ct < 1) { - print "Largest donor: " i; - ct++; - } - }; - - ct=0; - - for (i in most_common_donor_by_state) { - if (ct < 1) { - print "Most common donor state: " i; - ct++; - } - } - - ct=0; - - if (EFLAG != "") { - print EFLAG" donors:"; - for (i in FIXME) { <2> - if (ct < 5) { - print "\t- " i ": " FIXME[i]; <3> - FIXME; <4> - } - } - } - - print "Total donations in USD by state:"; - - for (i in donor_total_by_state) { - if (i != "STATE" && i != "") { - print "\t- " i ": " donor_total_by_state[i]; - } - } - - print "----------------\n"; - - }' $file -done ----- - -<1> We should put "$something" here -- check out how we handle this is question (4) and look at the changes it question (5) to help isolate what goes here. -<2> What are we looping through here? All you need to do is change it to the only remaining `awk` array we haven't looped through in the rest of the code. -<3> Now we want to access the _value_ of the array -- it would make sense if it were the same array as the previous FIXME, right?! -<4> Without this code, we will print ALL of the donors -- not just the first 5. - -Then test it out! - -[source,ipython] ----- -%%bash - -$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc deleted file mode 100644 index f6a50bf4c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project07.adoc +++ /dev/null @@ -1,341 +0,0 @@ -= STAT 29000: Project 7 -- Fall 2021 -:page-mathjax: true - -== Bashing out liquor sales data - -**Motivation:** A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. `awk` is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks. - -**Context:** This is the second project in a series of projects focused on bash _and_ `awk`. Here, we take a deeper dive and create some more complicated awk scripts, as well as utilize the bash skills learned in previous projects. - -**Scope:** bash, `awk`, bash scripts, R, Python - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -- Write bash scripts to automate potential repeated tasks. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ghsf9s2n?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gwjid9sj?wid=_983291"></iframe> -++++ - -You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful. - -The data will look something like the following: - ----- -1013 MAINKEOKUK 52632(40.39978, -91.387531) ----- - -What this means is that you can't just parse out the latitude and longitude coordinates and call it a day -- you need to use `awk` functions like `gsub` and `split` to extract the latitude and longitude coordinates. - -Use `awk` to print out the latitude and longitude for each line in the original dataset. Output should resemble the following. - ----- -lat,lon -1.23,4.56 ----- - -[NOTE] -==== -Make sure to take care of rows that don't have latitude and longitude coordinates -- just skip them. So if your results look like this, you need to add logic to skip the "empty" rows: - ----- -40.39978, -91.387531 -40.739238, -95.02756 -40.624226, -91.373211 -, -41.985887, -92.579244 ----- - -To do this, just go ahead and wrap your print in an if statement similar to: - -[source,awk] ----- -if (length(coords[1]) > ) { - print coords[1]";"coords[2] -} ----- -==== - -[TIP] -==== -`split` and `gsub` will be useful `awk` functions to use for this question. -==== - -[TIP] -==== -If we have a bunch of data formatted like the following: - ----- -1013 MAINKEOKUK 52632(40.39978, -91.387531) ----- - -If we first used `split` to split on "(", for example like: - -[source,awk] ----- -split($8, coords, "(", seps); ----- - -`coords[2]` would be: - ----- -40.39978, -91.387531) ----- - -Then, you could use `gsub` to remove any ")" characters from `coords[2]` like: - -[source,awk] ----- -gsub(/\)/, "", coords[2]); ----- - -`coords[2]` would be: - ----- -40.39978, -91.387531 ----- - -At this point I'm sure you can see how to use `awk` to extract and print the rest! -==== - -[IMPORTANT] -==== -Don't forget any lingering space after the first comma! We don't want that. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_yoz50h21?wid=_983291"></iframe> -++++ - -Redo question (4) (and reproduce `sales_by_store.csv`) from project (5), but this time add 2 additional columns to the dataset -- `lat` and `lon`. - -- 'lat': latitude -- 'lon': longitude - -Before you panic (this was a tough question), we've provided the solution below as a starting point for you. - -[source,ipynb] ----- -%%bash - -awk -F';' 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" } - { - gsub(/\$/, "", $22); split($2, dates, "/", seps); - mysales[$4";"dates[1]";"dates[3]] += $22; - myvolumes[$4";"dates[1]";"dates[3]] += $24; - } - END{ - for (mytriple in mysales) - { - print mytriple";"mysales[mytriple]";"myvolumes[mytriple] - } - }' /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > sales_by_store.csv ----- - -[CAUTION] -==== -It may take a few minutes to run this script. Grab a coffee, tea, or something else to keep you going. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Believe it or not, `awk` even supports geometric calculations like `sin` and `cos`. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points. - -Okay, so how to get started? To calculate this, we can use https://en.wikipedia.org/wiki/Haversine_formula[the Haversine formula]. The formula is: - -$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$ - -Where: - -- $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers -- $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points -- $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points - -In `awk`, `sin` is `sin`, `cos` is `cos`, and `sqrt` is `sqrt`. - -To get the `arcsin` use the following `awk` function: - -[source,awk] ----- -function arcsin(x) { return atan2(x, sqrt(1-x*x)) } ----- - -To convert from degrees to radians, use the following `awk` function: - -[source,awk] ----- -function dtor(x) { return x*atan2(0, -1)/180 } ----- - -The following is how the script should work (with a real example you can test): - -[source,bash] ----- -./question3.sh 40.39978 -91.387531 40.739238 -95.02756 ----- - -.Results ----- -309.57 ----- - -[TIP] -==== -To include functions in your `awk` command, do as follows: - -[source,bash] ----- -awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{ - lat1 = dtor(lat1); - print lat1; - # rest of your code here! -}' ----- -==== - -[TIP] -==== -We want you to create a bash script called `question3.sh`. After you have your bash script, we want you to run it in a bash cell to see the output. - -The following is some skeleton code that you can use to get started. - -[source,bash] ----- -#!/bin/bash - -lat1=$1 -lat2=$3 -lon1=$2 -lon2=$4 - -awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{ - lat1 = dtor(lat1); - print lat1; - # rest of your code here! -}' ----- -==== - -[TIP] -==== -You may need to give your script execute permissions like this. - -[source,bash] ----- -chmod +x /path/to/question3.sh ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Create a new bash script called `question4.sh` that accepts a latitude, longitude, filename, and n. - -The latitude and longitude are a point that we want to calculate the distance from. - -The filename is `sales_by_store.csv` -- our resulting dataset from question 3. - -Finally, n is the number of stores from our `sales_by_store.csv` file that we want to calculate the distance from the provided longitude and latitude. - -[source, bash] ----- -./question4.sh 40.39978 -91.387531 sales_by_store.csv 3 ----- - -.Output ----- -Distance from (40.39978,-91.387531) -store_name,distance -The Music Station,253.915 -KUM & GO #4 / LAMONI,213.455 -KUM & GO #4 / LAMONI,213.447 ----- - -To get you started, you can use the following "starter" code. Fix the code to work: - -[source,bash] ----- -#!/bin/bash - -lat_from=$1 -lon_from=$2 -file=$3 -n=$4 - -awk -F';' -v n=$n -v lat_from=$lat_from -v lon_from=$lon_from 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }function distance(lat1, lon1, lat2, lon2) { - # question 2 code here <1> - return dist; -}BEGIN { - print "Distance from ("lat_from","lon_from")" - print "store_name,distance"; -} NR>1 && NR <= n+1 { - lat2 = FIXME; <2> - lon2 = FIXME; <3> - dist = distance(lat_from, lon_from, FIXME, FIXME); <4> - print $1","dist -}' $file ----- - -<1> Add your code from question 2 here and make sure your distance is stored in a variable called `dist` (which we return). -<2> Which value goes here? -<3> Which value goes here? -<4> Which values go here? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (optional, 0 pts) - -Use your choice of Python or R, with our `sales_by_store.csv` to create a beautiful graphic mapping the latitudes and longitudes of the stores. If you want to, get creative and increase the size of the points on the map based on the number of sales. You could create a graphic for each month to see how sales change month-to-month. The options are limitless, get creative! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc deleted file mode 100644 index 4bd86e052..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project08.adoc +++ /dev/null @@ -1,262 +0,0 @@ -= STAT 29000: Project 8 -- Fall 2021 - -**Motivation:** - -**Context:** This is the third and final part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and `awk`. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently. - -**Scope:** awk, bash scripts, R, Python - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -- Write bash scripts to automate potential repeated tasks. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/taxi/*` - -== Questions - -[NOTE] -==== -This is the _last_ project based on bash and awk -- the rest are SQL. If you struggled or did not like the bash projects, you are not alone! This is frequently the most intimidating for students. Students tend to really like the SQL projects, so relief is soon to come. -==== - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_y9z5wli4?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_7gvv1bi0?wid=_983291"></iframe> -++++ - -Take some time to explore `/depot/datamine/data/taxi/**`, and answer the following questions using UNIX utilities. - -- In which two directories is the bulk of the data (except `fhv` -- we don't care about that data for now)? -- What is the total size in Gb of the data in those two directories? - -[NOTE] -==== -So for example do all the files in `dir1` have the same number of columns for every row? Do the files in `dir2` have the same number of columns for every row? -==== - -[TIP] -==== -Check out the PDFs in the directory to learn more about the dataset. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4atk2z25?wid=_983291"></iframe> -++++ - -To start, let's focus on the yellow taxi data. The `Total_Amt` column is the total cost of the taxi ride. It is broken down into 4 categories: `Fare_Amt`, `surcharge`, `Tip_Amt`, and `Tolls_Amt`. - -Write a bash script called `question2.sh` that accepts a path to a yellow taxi data file as an argument, and returns a breakdown of the overall percentage each of the 4 categories make up of the total. - -.Example output ----- -fares: 5.0% -surcharges: 2.5% -tips: 2.5% -tolls: 90.0% ----- - -To help get you started, here is some skeleton code. - -[source,bash] ----- -#!/bin/bash - -awk -F',' '{ - # calculate stuff - fares+=$13; -} END { - # print stuff -}' $1 ----- - -[IMPORTANT] -==== -Make sure your output format matches this example exactly. Every value should be with 1 decimal place followed by a percentage sign. -==== - -[CAUTION] -==== -It may take a minute to run. You are processing 2.5G of data! -==== - -[TIP] -==== -https://unix.stackexchange.com/questions/383378/awk-with-one-decimal-place[This] link may be useful. -==== - -[TIP] -==== -The result of the following. - -[source,ipynb] ----- -%%bash - -chmod +x ./question2.sh -./question2.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv ----- - -Should be: - ----- -fares: 92.6% -surcharges: 1.7% -tips: 4.5% -tolls: 1.1% ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Did you know `awk` has the ability to process multiple files at once? Pass multiple files to your script from question (2) to test it out. - -[source,bash] ----- -%%bash - -chmod +x ./question3.sh -./question3.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-02.csv ----- - -Now, modify your script from question (2). Return the summary values from question (2), but for each month instead of for the overall data. Use `Trip_Pickup_dateTime` to determine the month. - -.Example output -.... -January ----- -fares: 5.0% -surcharges: 2.5% -tips: 2.5% -tolls: 90.0% ----- - -February ----- -fares: 5.0% -surcharges: 2.5% -tips: 2.5% -tolls: 90.0% ----- - -etc.. -.... - -[IMPORTANT] -==== -You may will need to pass more than 1 file to your script in order to get more than 1 month of output. -==== - -To help get you started, you can find some skeleton code below. - -[source,bash] ----- -#!/bin/bash - -awk -F',' 'BEGIN{ - months[1] = "January" - months[2] = "February" - months[3] = "March" - months[4] = "April" - months[5] = "May" - months[6] = "June" - months[7] = "July" - months[8] = "August" - months[9] = "September" - months[10] = "October" - months[11] = "November" - months[12] = "December" -} NR > 1 { - # use split to parse out the month - - # convert the month to int - month = int(); - - # sum values by month using awk array - -} END { - for (m in total) { - if (m != 0) { - # print stuff - } - } -}' $@ ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -Pick 1 of the 2 following questions to answer. If you would like to answer both, your instructors and graders will be wow'd and happy (no pressure)! - -To be clear, however, you only need to answer 1 of the following 2 questions in order to get full credit. -==== - -=== Question 4 (Option 1) - -There are a lot of interesting questions that you could ask for this dataset. Here are some questions that could be interesting: - -- Does time of day, day of week, or month of year appear to have an effect on tips? -- Are people indeed more generous (with tips) near Christmas? -- How many trips are there, by hour of day? What are the rush hours? -- Do different vendors charge more or less than other vendors? - -Either choose a provided question, or write your own. Use your newfound knowledges of UNIX utilities and bash scripts to answer the question. Include the question you want answered, what, if any, hypotheses you have, what the data told you, and what you conclude (anecdotally). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 (Option 2) - -Standard UNIX utilities are not the end-all be-all to terminal tools. https://github.com/ibraheemdev/modern-unix[this repository] has a lot of really useful tools that tend to have an opinionated take on a classic UNIX tool. - -https://github.com/BurntSushi/ripgrep[ripgrep] is the poster child of this new generation of tools. It is a text search utility that is empirically superior in the majority of metrics (to `grep`). Additionally, it has subjectively better defaults. You can read (in _great_ detail) about ripgrep https://blog.burntsushi.net/ripgrep/[here]. - -In addition to those tools, there is https://github.com/BurntSushi/xsv[xsv from the same developer as ripgrep]. `xsv` is a utility designed to perform operations on delimited separated value files. Many of the questions that have been asked about in the previous few projects could have been quickly and easily answered using `xsv`. - -Most of these utilities are available to you in a `bash` cell in Jupyter Lab. Choose 2 questions from previous projects and re-answer them using these modern tools. Which did you prefer, and why? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc deleted file mode 100644 index 8d35c866c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project09.adoc +++ /dev/null @@ -1,318 +0,0 @@ -= STAT 29000: Project 9 -- Fall 2021 - -**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://cloudflare.com[Cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is well worth your time! - -**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite. - -**Scope:** SQL, sqlite - -.Learning Objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook. - -[source,ipython] ----- -%load_ext sql -%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db ----- - -The first command loads the sql extension. The second command connects to the database. - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_0uc68okg?wid=_983291"></iframe> -++++ - -Get started by taking a look at the available tables in the database. What tables are available? - -[TIP] -==== -You'll want to prepend `%%sql` to the top of the cell -- it should be the very first line of the cell (no comments or _anything_ else before it). - -[source,ipython] ----- -%%sql - --- Query here ----- -==== - -[TIP] -==== -In sqlite, you can show the tables using the following query: - -[source, sql] ----- -.tables ----- - -Unfortunately, sqlite-specific functions can't be run in a Jupyter Lab cell like that. Instead, we need to use a different query. - -[source, sql] ----- -SELECT tbl_name FROM sqlite_master where type='table'; ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_cwuc83d9?wid=_983291"></iframe> -++++ - -Its always a good idea to get an idea what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table. Write and run 6 queries that return the first 5 rows of data of each table. - -To get a better idea of the size of the data, you can use the `count` clause to get the number of rows in each table. Write an run 6 queries that returns the number of rows in each table. - -[TIP] -==== -Run each query in a separate cell, and remember to limit the query to return only 5 rows each. - -You can use the `limit` clause to limit the number of rows returned. -==== - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-the-first-5-rows-of-the-employees-table[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_qw35znbj?wid=_983291"></iframe> -++++ - -This dataset contains movie data from https://imdb.com (an Amazon company). As you can probably guess, it would be difficult to load the data from those tables into a nice, neat dataframe -- it would just take too much memory on most systems! - -Okay, let's dig into the `titles` table a little bit. Run the following query. - -[source, sql] ----- -SELECT * FROM titles LIMIT 5; ----- - -As you can see, every row has a `title_id` for the associated title of a movie or tv show (or other). What is this `title_id`? Check out the following link: - -https://www.imdb.com/title/tt0903747/ - -At this point, you may suspect that it is the id imdb uses to identify a movie or tv show. Well, let's see if that is true. Query our database to get any matching titles from the `titles` table matching the `title_id` provided in the link above. - -[TIP] -==== -The `where` clause can be used to filter the results of a query. -==== - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_59g2knqk?wid=_983291"></iframe> -++++ - -That is pretty cool! Not only do you understand what the `title_id` means _inside_ the database -- but now you know that you can associate a web page with each `title_id` -- for example, if you run the following query, you will get a `title_id` for a "short" called "Carmencita". - -[source, sql] ----- -SELECT * FROM titles LIMIT 5; ----- - -.Output ----- -title_id, type, ... -tt0000001, short, ... ----- - -If you navigate to https://www.imdb.com/title/tt0000001/, sure enough, you'll see a neatly formatted page with data about the movie! - -Okay great. Now, if you take a look at the `episodes` table, you'll see that there are both an `episode_title_id` and `show_title_id` associated with each row. - -Let's try and make sense of this the same way we did before. Write a query using the `where` clause to find all rows in the `episodes` table where `episode_title_id` is `tt0903747`. What did you get? - -Now, write a query using the `where` clause to find all rows in the `episodes` table where `show_title_id` is `tt0903747`. What did you get? - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_z9hiq9xv?wid=_983291"></iframe> -++++ - -Very interesting! It looks like we didn't get any results when we queried for `episode_title_id` with an id of `tt0903747`, but we did for `show_title_id`. This must mean these ids can represent both a _show_ as well as the _episode_ of a show. By that logic, we should be able to find the _title_ of one of the Breaking Bad episodes, in the same way we found the title of the show itself, right? - -Okay, take a look at the results of your second query from question (4). Choose one of the `episode_title_id` values, and query the `titles` table to find the title of that episode. - -Finally, in a browser, verify that the title of the episode is correct. To verify this, take the `episode_title_id` and plug it into the following link. - -https://www.imdb.com/title/<episode_title_id>/ - -So, I used `tt1232248` for my query. I would check to make sure it matches this. - -https://www.imdb.com/title/tt1232248/ - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_swv17gx6?wid=_983291"></iframe> -++++ - -Okay, you should have now established that every _row_ in the `titles` table correlates to the title of a single episode of a tv show, the tv show itself, a movie, a short, or any other type of media that has a title! A single tv show, will have both a `title_id` for the name of the show itself, as well as a `title_id` for each individual episode. - -What if we wanted to get a list of episodes (_including_ the titles) for the show? Well, the _best_ way would probably be to use a _join_ statement -- but we are _just_ getting started, so we will skip that option (for now). - -Instead, we can use what is called a _subquery_. A _subquery_ is a query that is embedded inside another query. In this case, we are going to use a _subquery_ to find all the `episode_title_id` values for Breaking Bad, and use the `where` clause to filter our titles from our `titles` table where the `title_id` from the `titles` table is _in_ the result of our subquery. - -The following are some steps to help you figure this out. - -. Write a query that finds all the `episode_title_id` values for Breaking Bad. -+ -[TIP] -==== -We only need/want to keep the `episode_title_id` values, not the other fields like `show_title_id` or `season_number` or `episode_number`. -==== -+ -. Once you have your query, use it as a _subquery_ to find all the `title_id` values for Breaking Bad. -+ -[TIP] -==== -Here is the general "form" for this. - -[source, sql] ----- -SELECT _ FROM (SELECT _ FROM _ WHERE _) WHERE _; ----- - -Where the part surrounded by parentheses is the _subquery_. - -Of course, for this question, we just want to see if the `title_id` values are in the result of our subquery. For this, we can use the `in` operator. - -[source, sql] ----- -SELECT _ FROM _ WHERE _ IN (SELECT _ FROM _ WHERE_); ----- -==== - -When done correctly, you should get a list of all of the `titles` table data for every episode in Breaking Bad, cool! - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_3qc81cv5?wid=_983291"></iframe> -++++ - -Okay, this _subquery_ thing is pretty useful, and a _little_ confusing. How about we practice some more? - -Just like in question (6), get a list of the ratings from the `ratings` table for every episode of Breaking Bad. Sort the results from highest to lowest by `rating`. What was the `title_id` of the episode with the highest rating? What was the rating? - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 8 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_exz1uqmd?wid=_983291"></iframe> -++++ - -Write a query that finds a list of `person_id` values (and _just_ `person_id` values) for the episode of Breaking Bad with `title_id` of `tt2301451`. Use the `crew` table to do this. Limit your results to _actors_ only. - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 9 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_x4ifw9xd?wid=_983291"></iframe> -++++ - -Use the query from question (8) as a subquery to get the following output. - ----- -Name | Approximate Age ----- - -Use _aliases_ to rename the output. To calculate the approximate age, subtract the year the actor was born from 2021 -- that will be accurate for the majority of people. - -**Relevant topics:** xref:book:SQL:queries.adoc#examples[queries], xref:book:SQL:aliasing.adoc[aliasing] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc deleted file mode 100644 index 3d7f679cf..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project10.adoc +++ /dev/null @@ -1,144 +0,0 @@ -= STAT 29000: Project 10 -- Fall 2021 - -**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data. - -**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`. - -**Scope:** SQL, sqlite - -.Learning Objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/taxi/taxi_sample.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_30pbxb6h?wid=_983291"></iframe> -++++ - -In project (8), you used bash tools, including `awk`, to parse through large amounts of yellow taxi data from `/depot/datamine/data/taxi/`. Of course, calculating things like the mean is not too difficult using `awk`, and `awk` _is_ extremely fast and efficient, BUT SQL is better for some of the work we attempted to do in project (8). - -Don't take my word on it! We've placed a sample of 5 of the data files for the yellow taxi cab into an SQLite database called `taxi_sample.db`. This database contains, among other things, the `yellow` table (for yellow taxi cab data). - -Write a query that will return the `fare_amount`, `surcharge`, `tip_amount`, and `tolls_amount` as a percentage of `total_amount`. - -Now, take into consideration that this query will be evaluating these percentages for 5 of the data files, not just the first file or so. Wow, impressive! - -[TIP] -==== -Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_m1hugc29?wid=_983291"></iframe> -++++ - -Check out the `payment_type` column. Write a query that counts the number of each type of `payment_type`. The end result should print something like the following. - -.Output sample ----- -payment_type, count -CASH, 123 ----- - -[TIP] -==== -You can use aliasing to control the output header names. -==== - -Write a query that sums the `total_amount` for `payment_type` of "CASH". What is the total amount of cash payments? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jnuhc0tw?wid=_983291"></iframe> -++++ - -Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query. - -Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_yhb5nx17?wid=_983291"></iframe> -++++ - -Write a query that gets the average `total_amount` for each year in the database. Which year has the largest average `total_amount`? Use the `pickup_datetime` column to determine the year. - -[TIP] -==== -Read https://www.sqlite.org/lang_datefunc.html[this] page and look at the strftime function. -==== - -[TIP] -==== -If you want the headers to be more descriptive, you can use aliases. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -What percent of data in our database has information on the _location_ of pickup and dropoff? Examine the data, to see if there is a pattern to the rows _with_ that information and _without_ that information. - -[TIP] -==== -There _is_ a distinct pattern. Pay attention to the date and time of the data. -==== - -Confirm your hypothesis with the original data set(s) (in `/depot/datamine/data/taxi/yellow/*.csv`), using bash. This doesn't have to be anything more thorough than running a simple `head` command with a 1-2 sentence explanation. - -[TIP] -==== -Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the `avg` function on a latitude or longitude column, by _year_ to maybe get a pattern. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc deleted file mode 100644 index 0adb5a950..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project11.adoc +++ /dev/null @@ -1,270 +0,0 @@ -= STAT 29000: Project 11 -- Fall 2021 - -**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like `MIN`, `MAX`, and `AVG` in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values. - -**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values! - -**Scope:** SQL, SQL in R - -.Learning Objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc. -- Showcase the ability to filter, alias, and write subqueries. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook. - -[source,ipython] ----- -%load_ext sql -%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db ----- - -The first command loads the sql extension. The second command connects to the database. - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xpynxtod?wid=_983291"></iframe> -++++ - -Let's say we are interested in the Marvel Cinematic Universe (MCU). We could write the following query to get the titles of all the movies in the MCU (at least, available in our database). - -[source, sql] ----- -SELECT premiered, COUNT(*) FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered; ----- - -The result would be a perfectly good-looking table. Now, with that being said, are the headers good-looking? I don't know about you, but `COUNT(*)` as a header is pretty bad looking. xref:book:SQL:aliasing.adoc[Aliasing] is a great way to not only make the headers look good, but it can also be used to reduce the text in a query by giving some intermediate results a shorter name. - -Fix the query so that the headers are `year` and `movie count`, respectively. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ntm1v4gy?wid=_983291"></iframe> -++++ - -Okay, let's say we are interested in modifying our query from question (1) to get the _percentage_ of MCU movies released in each year. Essentially, we want the count for each group, divided by the total count of all the movies in the MCU. - -We can achieve this using a _subquery_. A subquery is a query that is used to get a smaller result set from a larger result set. - -Write a query that returns the total count of the movies in the MCU, and then use it as a subquery to get the percentage of MCU movies released in each year. - -[TIP] -==== -You do _not_ need to change the query from question (1), rather, you just need to _add_ to the query. -==== - -[TIP] -==== -You can directly divide `COUNT(*)` from the original query by the subquery to get the result! -==== - -[IMPORTANT] -==== -Your initial result may seem _very_ wrong (no fractions at all!) this is OK -- we will fix this in the next question. -==== - -[IMPORTANT] -==== -Use aliasing to rename the new column to `percentage`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hmm1il25?wid=_983291"></iframe> -++++ - -Okay, if you did question (2) correctly, you should have got a result that looks a lot like: - -.Output ----- -year,movie count,percentage -2008, 2, 0 -2010, 1, 0 -2011, 2, 0 -... ----- - -What is going on? - -The `AS` keyword can _also_ be used to _cast_ types. Some of you may or may not be familiar with a feature of many programming languages. Common in many programming languages is an "integer" type -- which is for numeric data _without_ a decimal place, and a "float" type -- which is for numeric data _with_ a decimal place. In _many_ languages, if you were to do the following, you'd get what _may_ be unexpected output. - -[source,c] ----- -9/4 ----- - -.Output ----- -2 ----- - -Since both of the values are integers, the result will truncate the decimal place. In other words, the result will be 2, instead of 2.25. - -In Python, they've made changes so this doesn't happen. - -[source,python] ----- -9/4 ----- - -.Output ----- -2.25 ----- - -However, if we want the "regular" functionality we can use the `//` operator. - -[source,python] ----- -9//4 ----- - -.Output ----- -2 ----- - -Okay, sqlite does this as well. - -[source, sql] ----- -SELECT 9/4 as result; ----- - -.Output ----- -result -2 ----- - -_This_ is why we are getting 0's for the percentage column! - -How do we fix this? The following is an example. - -[source, sql] ----- -SELECT CAST(9 AS real)/4 as result; ----- - -.Output ----- -result -2.25 ----- - -[NOTE] -==== -Here, "real" represents "float" or "double" -- it is another way of saying a number with a decimal place. -==== - -[IMPORTANT] -==== -When you do arithmetic with an integer and a real/float, the result will be a real/float. -==== - -Fix the query so that the results look something like: - -.Output ----- -year, movie count, percentage -2008, 2, 0.0689... -2010, 1, 0.034482... -2011, 2, 0.0689... ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_h6a8oq93?wid=_983291"></iframe> -++++ - -You now know 2 different applications of the `AS` keyword, and you also know how to use a query as a subquery, great! - -In the previous project, we were introduced to aggregate functions. We used the GROUP BY clause to group our results by the `premiered` column in this project too! We know we can use the `WHERE` clause to filter our results, but what if we wanted to filter our results based on an aggregated column? - -Modify our query from question (3) to print only the rows where the `movie count` is greater than 2. - -[TIP] -==== -See https://www.geeksforgeeks.org/having-vs-where-clause-in-sql/[this article] for more information on the `HAVING` and `WHERE` clauses. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_umyor89l?wid=_983291"></iframe> -++++ - -Write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3. - -Look at the results. Which year had the lowest average number of words in the `primary_title` column (no need to write another query for this, just eyeball it)? - -[TIP] -==== -See https://stackoverflow.com/questions/3293790/query-to-count-words-sqlite-3[here]. Replace "@String" with the column you want to count the words in. -==== - -[TIP] -==== -If you got it right, there should be 15 rows in the output. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc deleted file mode 100644 index a8fba9cd5..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project12.adoc +++ /dev/null @@ -1,143 +0,0 @@ -= STAT 29000: Project 12 -- Fall 2021 - -**Motivation:** Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious. - -**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables. - -**Scope:** SQL, sqlite, joins - -.Learning Objectives -**** -- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. -- Showcase the ability to filter, alias, and write subqueries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook. - -[source,ipython] ----- -%load_ext sql -%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db ----- - -The first command loads the sql extension. The second command connects to the database. - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_3i402ywl?wid=_983291"></iframe> -++++ - -In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year. - -Now that we are learning about _joins_, we have the ability to make much more interesting queries! - -Use the provided list of `title_id` values to get a list of the MCU movie `primary_title` values, `premiered` values, and rating (from the provided list of MCU movies). - -Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list). - -.List of MCU title_ids ----- -('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wa3tnqx0?wid=_983291"></iframe> -++++ - -Run the following query. - -[source,ipython] ----- -%%sql - -SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286'); ----- - -Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1). - -Write a query that returns the `primary_title` of those movies _not_ shown in the result of question (1) but that _are_ shown in the result of the query above. You can use the query in question (1) as a subquery to answer this. - -Can you notice a pattern to said movies? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_nq2bfeu1?wid=_983291"></iframe> -++++ - -In the previous questions we explored what is _actually_ the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1). - -Modify your query from question (1) so that it returns a list of _all_ MCU movies with their associated rating, regardless of whether or not the movie has a rating. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -In the previous project, question (5) asked you to write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3. - -Okay, great. What would be more interesting would be to see the average number of words in the `primary_title` column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have? - -Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -We have a fun database, and you've learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn't previously (easily) tackle, and answers a question you are interested in. - -Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join. - -.Items to submit -==== -- A written question about the movies/tv shows in the database. -- Code used to solve this problem. -- Output from running the code. -- Explanation of the results, what your query does, and why you chose either a LEFT or INNER join. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc deleted file mode 100644 index 2f668bb06..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-project13.adoc +++ /dev/null @@ -1,325 +0,0 @@ -= STAT 29000: Project 13 -- Fall 2021 - -**Motivation:** In the previous projects, you've gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we _haven't_ yet done is creating your _own_ database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and _many_ of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets. - -**Context:** This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the [sqlite3](https://www.sqlite.org/) command line tool to interact with the database. - -**Scope:** sql, sqlite, unix - -.Learning Objectives -**** -- Create a sqlite database schema. -- Populate the database with data using `INSERT` statements. -- Populate the database with data using the command line interface (CLI) for sqlite3. -- Run queries on a database. -- Create an index to speed up queries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/flights/subset/2007.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hzwmd21k?wid=_983291"></iframe> -++++ - -First thing is first, create a new Jupyter Notebook called `firstname-lastname-project13.ipynb`. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the `sqlite3` command line tool from the terminal window. - -Okay, once completed, the first step is schema creation. First, it is important to note. **The goal of this project is to put the data in `/depot/datamine/data/flights/subset/2007.csv` into a sqlite database we will call `firstname-lastname-project13.db`.** - -With that in mind, run the following (in your terminal) to get a sample of the data. - -[source,bash] ----- -head /depot/datamine/data/flights/subset/2007.csv ----- - -You _should_ receive a result like: - -.Output ----- -Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay -2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0 -2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0 -2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31 -2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3 -2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0 -2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0 -2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1 -2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0 -2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24 ----- - -An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types -- just more "set in stone" than the very easily changed data.frame. - -Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the `sqlite_schema` table will contain the type, name, tbl_name, rootpage, and sql for the database object. - -[NOTE] -==== -For sqlite, the "database object" could refer to a table, index, view, or trigger. -==== - -This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is https://www.sqlite.org/schematab.html[here]. - -For _our_ purposes, when I refer to "schema", what I _really_ mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the _exact_ commands to build the database from scratch _without_ the data itself. For example, take a look at our `imdb.db` database by running the following in your terminal. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load sqlite/3.36.0 - -sqlite3 /depot/datamine/data/movies_and_tv/imdb.db ----- - -This will open the command line interface (CLI) for sqlite3. It will look similar to: - -[source,bash] ----- -sqlite> ----- - -Type `.schema` to see the "schema" for the database. - -[NOTE] -==== -Any command you run in the sqlite CLI that starts with a dot (`.`) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing `.help`. -==== - -After running `.schema`, you should see a variety of legitimate SQL commands that will create the structure of your database _without_ the data itself. This is an extremely useful self-documenting tool that is particularly useful. - -Okay, great. Now, let's study the sample of our `2007.csv` dataset. Create a markdown list of key:value pairs for each column in the dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column. - -For example: - -- Year: INTEGER - -Where the _value_ is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" https://www.sqlite.org/datatype3.html[here]. - -Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are _approximate_ numeric data types and _exact_ numeric data types. What you are most familiar with is the _approximate_ numeric data types. In R or Python for example, try running the following: - -[source,r] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -FALSE ----- - -[source,python] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -False ----- - -Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is _critical_ that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for _our_ example, store text as TEXT, numbers _without_ decimal places as INTEGER, and numbers with decimal places as REAL -- our example dataset doesn't have a need for NUMERIC. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_0ara5yu0?wid=_983291"></iframe> -++++ - -Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let's put together our `CREATE TABLE` statement that will create our table in the database. - -See https://www.sqlitetutorial.net/sqlite-create-table/[here] for some good examples. Realize that the `CREATE TABLE` statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table `flights`. - -Once you've written your `CREATE TABLE` statement, copy and paste it into the sqlite CLI. Upon success, you should see the statement printed when running the dot command `.schema`. Fantastic! You can also verify that the table exists by running the dot command `.tables`. - -Congratulations! To finish things off, please paste the `CREATE TABLE` statement into a markdown cell in your notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_b54otj5w?wid=_983291"></iframe> -++++ - -The next step in the project is to add the data! After all, it _is_ a _data_base. - -To insert data into a table _is_ a bit cumbersome. For example, let's say we wanted to add the following row to our `flights` table. - -.Data to add ----- -Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay -2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0 ----- - -The SQL way would be to run the following query. - -[source, sql] ----- -INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0); ----- - -NOT ideal -- especially since we have over 7 million rows to add! You could programmatically generate a `.sql` file with the `INSERT INTO` statement, hook the database up with Python or R and insert the data that way, _or_ you could use the wonderful dot commands sqlite already provides. - -You may find https://stackoverflow.com/questions/13587314/sqlite3-import-csv-exclude-skip-header[this post] very helpful. - -[WARNING] -==== -You want to make sure you _don't_ include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI. - -[source,bash] ----- -.header on -SELECT * FROM flights LIMIT 2; ----- - -The `.header on` dot command will print the header line for every query you run. If you have double entered the header line, it will appear twice. Once for the `.header on` and another time because that is the first row of your dataset. -==== - -Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table. - -[TIP] -==== -To connect to your database: - -[source,ipython] ----- -%load_ext sql -%sql sqlite:////home/PURDUEALIAS/flights.db ----- - -Assuming `flights.db` is in your home directory, and you change PURDUEALIAS to your alias, for example `mdw` for Dr. Ward or `kamstut` for Kevin. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_cn9x277z?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -For this question, please run take screenshots of your output from the terminal and add them to your notebook using a markdown cell. To do so, let's say you have an image called `my_image.png` in your $HOME directory. All you need to do is run the following command in a markdown cell: - -[source,ipython] ----- -![](/home/PURDUEALIAS/my_image.png) ----- - -Be sure to replace PURDUEALIAS with your alias. -==== - -Woohoo! You've successfully created a database and populated it with data from a dataset -- pretty cool! Now, run the following dot command in order to _time_ our queries: `.timer on`. This will print out the time it takes to run each query. For example, try the following: - -[source, sql] ----- -SELECT * FROM flights LIMIT 5; ----- - -Cool! Time the following query. - -[source, sql] ----- -SELECT * FROM flights ORDER BY DepTime LIMIT 1000; ----- - -.Output ----- -Run Time: real 1.824 user 0.836007 sys 0.605384 ----- - -That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index! - -Run the following query. - -[source, sql] ----- -EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232; ----- - -The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An _index_ is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. _Much_ more efficient! So what is the catch here? Space. - -Leave the sqlite CLI by running `.quit`. Now, see how much space your `flights.db` file is using. - -[source,bash] ----- -ls -la $HOME/flights.db ----- - -.Output ----- -571M ----- - -Okay, _after_ I add an index on the `DepTime` column, the file is now `653M` -- while that isn't a _huge_ difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to _update_ the indexes. This can add a _lot_ of time. These are just tradeoffs to consider when you're working with a database. - -In this case, we don't care about the extra bit of space -- create an index on the `DepTime` column. https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-indexes-indexes-c4e175f3c346[This article] is a nice easy read that covers this in more detail. - -Great! Once you've created your index, run the following query. - -[source, sql] ----- -SELECT * FROM flights ORDER BY DepTime LIMIT 1000; ----- - -.Output ----- -Run Time: real 0.263 user 0.014261 sys 0.032923 ----- - -Wow! That is some _serious_ improvement. What does the "plan" look like? - -[source, sql] ----- -EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232; ----- - -You'll notice the "plan" shows it will utilize the index to speed the query up. Great! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a _new_ index to speed the query up. Time the query _after_ creating the index. Did it work well? - -Document the steps of this problem just like you did for question (4). - -**Optional challenge:** Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc deleted file mode 100644 index f129981b1..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/29000/29000-f2021-projects.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= STAT 29000 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -* xref:fall2021/29000/29000-f2021-officehours.adoc[STAT 29000 Office Hours for Fall 2021] -* xref:fall2021/29000/29000-f2021-project01.adoc[Project 1: Review: How to use Jupyter Lab] -* xref:fall2021/29000/29000-f2021-project02.adoc[Project 2: Navigating UNIX: part I] -* xref:fall2021/29000/29000-f2021-project03.adoc[Project 3: Navigating UNIX: part II] -* xref:fall2021/29000/29000-f2021-project04.adoc[Project 4: Pattern matching in UNIX & R] -* xref:fall2021/29000/29000-f2021-project05.adoc[Project 5: `awk` & bash scripts: part I] -* xref:fall2021/29000/29000-f2021-project06.adoc[Project 6: `awk` & bash scripts: part II] -* xref:fall2021/29000/29000-f2021-project07.adoc[Project 7: `awk` & bash scripts: part III] -* xref:fall2021/29000/29000-f2021-project08.adoc[Project 8: `awk` & bash scripts: part IV] -* xref:fall2021/29000/29000-f2021-project09.adoc[Project 9: SQL: part I -- Introduction to SQL] -* xref:fall2021/29000/29000-f2021-project10.adoc[Project 10: SQL: part II -- SQL in R] -* xref:fall2021/29000/29000-f2021-project11.adoc[Project 11: SQL: part III -- SQL comparison] -* xref:fall2021/29000/29000-f2021-project12.adoc[Project 12: SQL: part IV -- Joins] -* xref:fall2021/29000/29000-f2021-project13.adoc[Project 13: SQL: part V -- Review] - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat29000 - -=== Link - -https://piazza.com/purdue/fall2021/stat29000/home - -== Syllabus - -++++ -include::book:ROOT:partial$syllabus.adoc[] -++++ - -== Office hour schedule - -++++ -include::book:ROOT:partial$office-hour-schedule.adoc[] -++++ \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc deleted file mode 100644 index 3514f4a64..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project01.adoc +++ /dev/null @@ -1,207 +0,0 @@ -= STAT 39000: Project 1 -- Fall 2021 - -== Mark~it~down, your first project back in The Data Mine - -**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python _with_ data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time. - -We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics. - -**Scope:** Jupyter Lab, R, Python, scholar, brown, markdown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Scholar and Brown. -- Review. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/` - -== Questions - -=== Question 1 - -In previous semesters, we've used a program called RStudio Server to run R code on Scholar and solve the projects. This year, we will be using Jupyter Lab almost exclusively. Let's being by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using 2-factor authentication (ACCESS login on Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dv46pmsw?wid=_983291"></iframe> -++++ - -[IMPORTANT] -==== -In the not-to-distant future, we will be using _both_ Scholar (https://gateway.scholar.rcac.purdue.edu) _and_ Brown (https://ondemand.brown.rcac.purdue.edu) to launch Jupyter Lab instances. For now, however, we will be using Brown. -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Datamine, Desktops, and GUIs. Under the Datamine section, you should see a button that says btn:[Jupyter Lab], click on btn:[Jupyter Lab]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand uses SLURM to launch a job to run Jupyter Lab. This job has access to 1 CPU core and 3072 Mb of memory. It is OK to not understand what that means yet, we will learn more about this in STAT 39000. For the curious, however, if you were to open a terminal session in Scholar and/or Brown and run the following, you would see your job queued up. - -[source,bash] ----- -squeue -u username # replace 'username' with your username ----- - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2021-s2022:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2021-s2022-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2021-s2022 kernel. Click on btn:[f2021-s2022], and a fresh notebook will be created for you. - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node you are running on). -==== - -=== Question 2 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Scholar and Brown at `/depot/datamine/apps/templates/project_template.ipynb`). - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5msf7x1z?wid=_983291"></iframe> -++++ - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 3 - -Last year, while using RStudio, you probably gained a certain amount of experience using RMarkdown -- a flavor of Markdown that allows you to embed and run code in Markdown. Jupyter Lab, while very different in many ways, still uses Markdown to add formatted text to a given notebook. It is well worth the small time investment to learn how to use Markdown, and create a neat and reproducible document. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_r607ju5b?wid=_983291"></iframe> -++++ - -Create a Markdown cell in your notebook. Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`. - -[TIP] -==== -You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/ -==== - -[TIP] -==== -Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered. -==== - -[TIP] -==== -If you are having trouble changing a cell due to the drop down menu behaving oddly, try changing browsers to Chrome or Safari. If you are a big Firefox fan, and don't want to do that, feel free to use the `%%markdown` magic to create a markdown cell without _really_ creating a markdown cell. Any cell that starts with `%%markdown` in the first line will generate markdown when run. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum): - -- A header for this section (your choice of size) that says "About". -+ -[TIP] -==== -A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`. -==== -+ -- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn. -- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Read xref:templates.adoc[the templates page] and learn how to run snippets of code in Jupyter Lab _other than_ Python. Run at least 1 example of Python, R, SQL, and bash. For SQL and bash, you can use the following snippets of code to make sure things are working properly. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_crus3z0q?wid=_983291"></iframe> -++++ - -[source, sql] ----- --- Use the following sqlite database: /depot/datamine/data/movies_and_tv/imdb.db -SELECT * FROM titles LIMIT 5; ----- - -[source,bash] ----- -ls -la /depot/datamine/data/movies_and_tv/ ----- - -For your R and Python code, use this as an opportunity to review your skills. For each language, choose at least 1 dataset from `/depot/datamine/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis, for each language. - -[TIP] -==== -You could answer _any_ question you have about your dataset you want. This is an open question, just make sure you put in a good amount of effort. Low/no-effort solutions will not receive full credit. -==== - -[IMPORTANT] -==== -Once done, submit your projects just like last year. See the xref:submissions.adoc[submissions page] for more details. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentence analysis for each of your R and Python code examples. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc deleted file mode 100644 index 7412774e4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project02.adoc +++ /dev/null @@ -1,305 +0,0 @@ -= STAT 39000: Project 2 -- Fall 2021 - -== The (art?) of a docstring - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the first project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/apple/health/watch_dump.xml` - -== Questions - -The topics of this semester are outlined in the xref:book:projects:39000-f2021-projects.adoc[39000 home page]. In addition to those topics, there will be a slight emphasis on topics related to working with APIs. Each project this semester will continue to be data-driven, and will be based on the provided dataset(s). The dataset listed for this project will be one that is revisited throughout the semester, as we will be slowly building out functions, modules, tests, documentation, etc, that will come together towards the end of the semester. Of course, all projects that expect any sort of previous work will provide you with previous work in case you chose to skip any given project. - -In this project we will work with pdoc to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data -- a common component to many data science and computer science projects, and a key topics to understand when working with APIs. - -For the sake of clarity, this project will have more deliverables than the "standard" `.ipynb` notebook, `.py` file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question. - -[WARNING] -==== -Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3. -==== - -=== Question 1 - -Let's start by navigating to https://ondemand.brown.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the `.ipynb` file). Jupyter Lab is actually _much_ more useful. You can open terminals on Brown (the cluster), as well as open a an editor for `.R` files, `.py` files, or any other text-based file. - -Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2021-s2022" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called `untitled.py`. Rename this file to `firstname-lastname-project02.py` (where `firstname` and `lastname` are your first and last name, respectively). - -[TIP] -==== -Make sure you are in your `$HOME` directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file. -==== - -Read the https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings["3.8.2 Modules" section] of Google's Python Style Guide. Each individual `.py` file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why? - -[TIP] -==== -Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the **serialized** data _looks_ like (if you tried to open it in a text editor, for example), or how it is _read_. -==== - -Save your module. - -**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Now, in Jupyter Lab, open a new notebook using the "f2021-s2022" kernel (using the link:{attachmentsdir}/project_template.ipynb[course notebook template]). - -[TIP] -==== -You can have _both_ the Python file _and_ the notebook open in separate Jupyter Lab tabs for easier navigation. -==== - -Fill in a code cell for question 1 with a Python comment. - -[source,python] ----- -# See firstname-lastname-project02.py ----- - -For this question, read the xref:book:python:pdoc.adoc[pdoc section], and run a `bash` command to generate the documentation for your module that you created in the previous question, `firstname-lastname-project02.py`. So, everywhere in the example in the pdoc section where you see "mymodule.py" replace it with _your_ module's name -- `firstname-lastname-project02.py`. - -[TIP] -==== -Use the `-o` flag to specify the output directory -- I would _suggest_ making it somewhere in your `$HOME` directory to avoid permissions issues. -==== - -Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called `firstname-lastname-project02.html`. To view this file in your browser, right click on the file, and select btn:[Open in New Browser Tab]. A new browser tab should open with your freshly made documentation. Pretty cool! - -[IMPORTANT] -==== -Ignore the `index.html` file -- we are looking for the `firstname-lastname-project02.html` file. -==== - -[TIP] -==== -You _may_ have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation. -==== - -[NOTE] -==== -At this stage, you have the ability to create a PDF based on the generated webpage (but you do not yet need to do so). To do so, click on menu:File[Print...> Destination > Save to PDF]. This may vary slightly from browser to browser, but it should be fairly straightforward. -==== - -**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -[NOTE] -==== -When I refer to "watch data" I just mean the dataset for this project. -==== - -Write a function to called `get_records_for_date` that accepts an `lxml` etree (of our watch data, via `etree.parse`), and a `datetime.date`, and returns a list of Record Elements, for a given date. Raise a `TypeError` if the date is not a `datetime.date`, or if the etree is not an `lxml.etree`. - -Use the https://google.github.io/styleguide/pyguide.html#383-functions-and-methods[Google Python Style Guide's "Functions and Methods" section] to write the docstring for this function. Be sure to include type annotations for the parameters and return value. - -Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read. - -Use the `-d` flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look? - -[TIP] -==== -The following code should help get you started. - -[source,python] ----- -import lxml.etree -from datetime import datetime, date - -# read in the watch data -tree = etree.parse('/depot/datamine/data/apple/health/watch_dump.xml') - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]: - # docstring goes here - - # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not - - # test if `for_date` is a `datetime.date`, and raise TypeError if not - - # loop through the records in the watch data using the xpath expression `/HealthData/Record` - # how to see a record, in case you want to - print(lxml.etree.tostring(record)) - - # test if the record's `startDate` is the same as `for_date`, and append to a list if it is - - # return the list of records - -# how to test this function -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) ----- -==== - -[TIP] -==== -The following is some code that will be helpful to test the types. - -[source,python] ----- -from datetime import datetime, date - -isinstance(some_date_object, date) # test if some_date_object is a date -isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree ----- -==== - -[TIP] -==== -To loop through records, you can use the `xpath` method. - -[source,python] ----- -for record in tree.xpath('/HealthData/Record'): - # do something with record ----- -==== - -Add this function to your `firstname-lastname-project02.py` file, and if you want, regenerate your new documentation that includes your new function. - -**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Great! Now, write a function called `to_msgpack`, that accepts an `lxml` Element, and an absolute path to the desired output file, checks to make sure it contains the following keys: `type`, `sourceVersion`, `unit`, and `value`, and encodes/serializes, then saves the result to the specified file. - -[TIP] -==== -The following code should help get you started. - -[source,python] ----- -import msgpack - -def to_msgpack(element: lxml.etree._Element, file: str) -> None: - # docstring goes here - - # test if `file` is a `str`, and raise TypeError if not - - # test if `element` is a `lxml.etree._Element`, and raise TypeError if not - - # convert `element.attrib` into a dict - - # test if the dict contains the keys `type`, `sourceVersion`, `unit`, and `value`, and raise ValueError if not - - # remove "other" non-type/sourceVersion/unit/value keys from the dict - - # use msgpack library to serialize the dict to a msgpack file - -# how to use this function -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) -to_msgpack(my_records[0], '$HOME/my_records.msgpack') ----- -==== - -[IMPORTANT] -==== -`to_msgpack(my_records[0], '$HOME/my_records.msgpack')` may not work, depending on how you set up your function, you may need to use an absolute path like `to_msgpack(my_records[0], '/home/kamstut/my_records.msgpack')`. -==== - -Then, write a function called `from_msgpack`, that accepts an absolute path to a serialized file, and returns an `lxml` Element. - -[TIP] -==== -The following code should help get you started. - -[source,python] ----- -def from_msgpack(file: str) -> lxml.etree._Element: - # docstring goes here - - # test if `file` is a `str`, and raise TypeError if not - - # deserialize the msgpack file into a dict - - # create new "Record" element - e = etree.Element('Record') - - # loop through keys and values in the dict - # and set the attributes of the new "Record" element - # NOTE: This assumed the dict is called "d" - for key, value in d.items(): - e.attrib[key] = str(value) - - # return the new "Record" element - -# how to use this function -print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack'))) ----- -==== - -[IMPORTANT] -==== -`print(lxml.etree.tostring(from_msgpack('$HOME/my_records.msgpack')))` may not work, depending on how you set up your function, you may need to use an absolute path like `print(lxml.etree.tostring(from_msgpack('/home/kamstut/my_records.msgpack')))`. -==== - -Add these functions to your `firstname-lastname-project02.py` file, and regenerate your documentation. You should see some great looking documentation with your new functions. - -**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -This was _hopefully_ a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code. - -Finally, investigate the https://pdoc.dev/docs/pdoc.html[official pdoc documentation], and make at least 2 changes/customizations to your module. Some examples are below -- feel free to get creative and do something with pdoc outside of this list of options: - -- Modify the module so you do not need to pass the `-d` flag in order to let pdoc know that you are using Google-style docstrings. -- Change the logo of the documentation to your own logo (or any logo you'd like). -- Add some math formulas and change the output accordingly. -- Edit and customize pdoc's jinja2 template (or CSS). - -**Relevant topics:** xref:book:python:pdoc.adoc[pdoc], xref:book:python:sphinx.adoc[Sphinx], xref:book:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc deleted file mode 100644 index 61e717f26..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project03.adoc +++ /dev/null @@ -1,617 +0,0 @@ -= STAT 39000: Project 3 -- Fall 2021 - -== Thank yourself later and document now - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/apple/health/watch_dump.xml` - -== Questions - -In this project, we are going to use the most popular Python documentation generation tool, Sphinx, to generate documentation for the module we created in project (2). If you chose to skip project (2), the module, in its entirety, will be posted at the latest, this upcoming Monday. You do _not_ need that module to complete this project. Your module from project (2) does not need to be perfect for this project. - -Last project was more challenging than intended. This project will provide a bit of a reprieve, and _should_ (hopefully) be fun to mess around with. - -project_02_module.py -[source,python] ----- -"""This module is for project 2 for STAT 39000. - -**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case. - -**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form. - -The following are some common serialization formats: - -- JSON -- Bincode -- MessagePack -- YAML -- TOML -- Pickle -- BSON -- CBOR -- Parquet -- XML -- Protobuf - -**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory. - -**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form. - -Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor. -""" - -import lxml -import lxml.etree -from datetime import datetime, date - - -def my_function(a, b): - """ - >>> my_function(2, 3) - 6 - >>> my_function('a', 3) - 'aaa' - >>> my_function(1, 3) - 4 - """ - return a * b - - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list: - """ - Given an `lxml.etree` object and a `datetime.date` object, return a list of records - with the startDate equal to `for_date`. - - Args: - tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object. - for_date (datetime.date): The date for which returned records should have a startDate equal to. - - Raises: - TypeError: If `tree` is not an `lxml.etree` object. - TypeError: If `for_date` is not a `datetime.date` object. - - Returns: - list: A list of records with the startDate equal to `for_date`. - """ - - if not isinstance(tree, lxml.etree._ElementTree): - raise TypeError('tree must be an lxml.etree') - - if not isinstance(for_date, date): - raise TypeError('for_date must be a datetime.date') - - results = [] - for record in tree.xpath('/HealthData/Record'): - if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date(): - results.append(record) - - return results - - -def from_msgpack(file: str) -> lxml.etree._Element: - """ - Given the absolute path a msgpack file, return the deserialized `lxml.Element` object. - - Args: - file (str): The absolute path of the msgpack file to deserialize. - - Raises: - TypeError: If `file` is not a `str`. - - Returns: - lxml.Element: The deserialized `lxml.Element` object. - """ - - if not isinstance(file, str): - raise TypeError('file must be a str') - - with open(file, 'rb') as f: - d = msgpack.load(f) - - e = etree.Element('Record') - for key, value in d.items(): - e.attrib[key] = str(value) - - return e - - -def to_msgpack(element: lxml.etree._Element, file: str) -> None: - """ - Given an `lxml.Element` object and a file path, serialize the `lxml.Element` object to - a msgpack file at the given file path. - - Args: - element (lxml.Element): The element to serialize. - file (str): The absolute path of the msgpack file to and save. - - Raises: - TypeError: If `file` is not a `str`. - TypeError: If `element` is not an `lxml.Element`. - - Returns: - None: None - """ - - if not isinstance(file, str): - raise TypeError('file must be a str') - - if not isinstance(element, lxml.etree._Element): - raise TypeError('element must be an lxml.Element') - - # Test if `type`, `sourceVersion`, `unit`, and `value` are present in the element. - d = dict(element.attrib) - if not d.get('type') or not d.get('sourceVersion') or not d.get('unit') or not d.get('value'): - raise ValueError('element must have all of the following keys: type, sourceVersion, unit, and value') - - # Remove "other" keys from the dict - keys_to_remove = [] - for key in d.keys(): - if key not in ['type', 'sourceVersion', 'unit', 'value']: - keys_to_remove.append(key) - - for key in keys_to_remove: - del d[key] - - with open(file, 'wb') as f: - msgpack.dump(d, f) - -if __name__ == '__main__': - import doctest - doctest.testmod() ----- - -=== Question 1 - -[IMPORTANT] -==== -Please use Firefox for this project. If you choose to use Chrome, the appearance of the documentation will be horrible. If you choose to use Chrome anyway, it is recommended that you change a setting in Chrome, temporarily, for this project, by typing (where you would normally put the URL): - ----- -chrome://flags ----- - -Then, search for "samesite". For "SameSite by default cookies", change from "Default" to "Disabled", and restart the browser. -==== - -- Create a new folder in your `$HOME` directory called `project3`. -- Create a new Jupyter notebook in that folder called `project3.ipynb`, based on the normal project template. -+ -[NOTE] -==== -The majority of this notebook will just contain a single `bash` cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being the PDF of the documentation's HTML page. -==== -+ -- Copy and paste the code from project (2)'s `firstname-lastname-project02.py` module into the `$HOME/project3` directory, you can rename this to be `firstname_lastname_project03.py`. -- In a `bash` cell in your Jupyter notebook, make sure you `cd` the `project3` folder, and run the following command: -+ -[source,bash] ----- -python -m sphinx.cmd.quickstart ./docs -q -p project3 -a "Kevin Amstutz" -v 1.0.0 --sep ----- -+ -[IMPORTANT] -==== -Please replace "Kevin Amstutz" with your own name. -==== -+ -[NOTE] -==== -What do each of these arguments do? Check out https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official documentation]. -==== - -You should be left with a newly created `docs` folder within your `project3` folder. Your structure should look something like the following. - -.project03 folder contents ----- -project03<1> -├── 39000_f2021_project03_solutions.ipynb<2> -├── docs<3> -│   ├── build <4> -│   ├── make.bat -│   ├── Makefile <5> -│   └── source <6> -│   ├── conf.py <7> -│   ├── index.rst <8> -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py<9> - -5 directories, 6 files ----- - -<1> Our module (named `project03`) folder -<2> Your project notebook (probably named something like `firstname_lastname_project03.ipynb`) -<3> Your documentation folder -<4> Your empty build folder where generated documentation will be stored -<5> The Makefile used to run the commands that generate your documentation. Make the following changes: -+ -[source,bash] ----- -# replace -SPHINXOPTS ?= -SPHINXBUILD ?= sphinx-build -SOURCEDIR = source -BUILDDIR = build - -# with the following -SPHINXOPTS ?= -SPHINXBUILD ?= python -m sphinx.cmd.build -SOURCEDIR = source -BUILDDIR = build ----- -+ -<6> Your source folder. This folder contains all hand-typed documentation. -<7> Your conf.py file. This file contains the configuration for your documentation. Make the following changes: -+ -[source,python] ----- -# CHANGE THE FOLLOWING CONTENT FROM: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -# import os -# import sys -# sys.path.insert(0, os.path.abspath('.') - -# TO: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -import os -import sys -sys.path.insert(0, os.path.abspath('../..')) ----- -+ -<8> Your index.rst file. This file (and all files ending in `.rst`) is written in https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[reStructuredText] -- a Markdown-like syntax. -<9> Your module. This is the module containing the code from the previous project, with nice, clean docstrings. - -Finally, with the modifications above having been made, run the following command in a `bash` cell in Jupyter notebook to generate your documentation. - -[source,bash] ----- -cd $HOME/project3/docs -make html ----- - -After complete, your module folders structure should look something like the following. - -.project03 folder contents ----- -project03 -├── 39000_f2021_project03_solutions.ipynb -├── docs -│   ├── build -│   │   ├── doctrees -│   │   │   ├── environment.pickle -│   │   │   └── index.doctree -│   │   └── html -│   │   ├── genindex.html -│   │   ├── index.html -│   │   ├── objects.inv -│   │   ├── search.html -│   │   ├── searchindex.js -│   │   ├── _sources -│   │   │   └── index.rst.txt -│   │   └── _static -│   │   ├── alabaster.css -│   │   ├── basic.css -│   │   ├── custom.css -│   │   ├── doctools.js -│   │   ├── documentation_options.js -│   │   ├── file.png -│   │   ├── jquery-3.5.1.js -│   │   ├── jquery.js -│   │   ├── language_data.js -│   │   ├── minus.png -│   │   ├── plus.png -│   │   ├── pygments.css -│   │   ├── searchtools.js -│   │   ├── underscore-1.13.1.js -│   │   └── underscore.js -│   ├── make.bat -│   ├── Makefile -│   └── source -│   ├── conf.py -│   ├── index.rst -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py - -9 directories, 29 files ----- - -In the left-hand pane in the Jupyter Lab interface, navigate to `$HOME/project3/docs/build/html/`, and right click on the `index.html` file and choose btn:[Open in New Browser Tab]. You should now be able to see your documentation in a new tab. - -[IMPORTANT] -==== -Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation. -==== - -.Items to submit -==== -- Code used to solve this problem (in 2 Jupyter `bash` cells). -==== - -=== Question 2 - -One of the most important documents in any package or project is the README.md file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like `numpy`, `pytorch`, or any other repository you've come across that you believe does a good job explaining the project. - -In the `docs/source` folder, create a new file called `README.rst`. Choose 3-5 of the following "types" of reStruturedText from the https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[this webpage], and create a fake README. The content can be https://www.lipsum.com/[Lorem Ipsum] type of content as long as it demonstrates 3-5 of the types of reStruturedText. - -- Inline markup -- Lists and quote-like blocks -- Literal blocks -- Doctest blocks -- Tables -- Hyperlinks -- Sections -- Field lists -- Roles -- Images -- Footnotes -- Citations -- Etc. - -[IMPORTANT] -==== -Make sure to include at least 1 https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections[section]. This counts as 1 of your 3-5. -==== - -Once complete, add a reference to your README to the `index.rst` file. To add a reference to your `README.rst` file, open the `index.rst` file in an editor and add "README" as follows. - -.index.rst -[source,rst] ----- -.. project3 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project3's documentation! -==================================== - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -[IMPORTANT] -==== -Make sure "README" is aligned with ":caption:" -- it should be 3 spaces from the left before the "R" in "README". -==== - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Pretty great! - -.Items to submit -==== -- Code used to solve this problem. -- Screenshot or PDF labeled "question02_results". -==== - -=== Question 3 - -The `pdoc` package was specifically designed to generate documentation for Python modules using the docstrings _in_ the module. As you may have noticed, this is not "native" to Sphinx. - -Sphinx has https://www.sphinx-doc.org/en/master/usage/extensions/index.html[extensions]. One such extension is the https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html[autodoc] extension. This extension provides the same sort of functionality that `pdoc` provides natively. - -To use this extension, modify the `conf.py` file in the `docs/source` folder. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc' -] ----- - -Next, update your `index.rst` file so autodoc knows which modules to extract data from. - -[source,rst] ----- -.. project3 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project3's documentation! -==================================== - -.. automodule:: firstname_lastname_project03 - :members: - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Not too bad! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, while the documentation looks pretty good, clearly, Sphinx does _not_ recognize Google style docstrings. As you may have guessed, there is an extension for that. - -Add the `napoleon` extension to your `conf.py` file. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc', - 'sphinx.ext.napoleon' -] ----- - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Much better! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -[WARNING] -==== -To make it explicitly clear what files to submit for this project: - -- `firstname_lastname_project03.py` -- `firstname_lastname_project03.ipynb` -- `firstname_lastname_project03.pdf` (result of exporting .ipynb to PDF) -- `firstname_lastname_project03_webpage.pdf` (result of printing documentation webpage to PDF) -==== - -At this stage, you should have a pretty nice set of documentation, with really nice in-code documentation in the form of docstrings. However, there is still another "thing" to add to your docstrings that can take them to the next level. - -`doctest` is a standard library tool that allows you to include code, with expected output _inside_ your docstring. Not only can this be nice for the user to see, but both `pdoc` and Sphinx applies special formatting to such additions to a docstring. - -Write a super simple function, it could be as simple as adding a couple of digits and returning a value. The following is an example. Come up with your own function with at least 1 passing test and 1 failing test (like the example). - -[source,python] ----- -def add(value1, value2): - """Function to add two values. - - The first example below will pass (because 1+1 is 2), the second will fail (because 1+2 is not 5) - - >>> add(1, 1) - 2 - - >>> add(1, 2) - 5 - """ - return value1 + value2 ----- - -Where ">>>" represents the Python REPL and code demonstrating how you would use the function, and the line immediately following is the expected output. - -[IMPORTANT] -==== -Make sure your function actually does something so you can test to see if it is working as intended or not. -==== - -To use doctest, add the following to the bottom of your `firstname_lastname_project03.py` file. - -[source,python] ----- -if __name__ == '__main__': - import doctest - doctest.testmod() ----- - -Now, in a new `bash` cell in your notebook, run the following command. - -[source,bash] ----- -python kevin_amstutz_project03.py -v ----- - -This will actually run your example code in the docstring and compare the output to the expected result! Very cool. We will learn more about this in the next couple of projects. - -[NOTE] -==== -When including the `-v` option, both passing _and_ failing tests will be printed. Without the `-v` option, only failling tests will be printed. -==== - -Now, regenerate your documentation again and check it out. Notice how the lines in the docstring are neatly formatted? Pretty great. - -Okay, last but not least, check out the themes https://sphinx-themes.org/[here], and choose one of the themes listed, regenerate your documentation, and save the webpage to a PDF for submission. Note that each theme may have slightly different requirements on how to "activate" it. For example, to use the "Readable" theme, you must add the following to your `conf.py` file. - -[source,python] ----- -import sphinx_readable_theme -html_theme = 'readable' -html_theme_path = [sphinx_readable_theme.get_html_theme_path()] ----- - -[TIP] -==== -You can change a theme by changing the value of `html_theme` in the `conf.py` file. -==== - -[TIP] -==== -If a theme doesn't work, just select a different theme. -==== - -[TIP] -==== -Unlike `pdoc` which only supports HTML output, Sphinx supports _many_ output formats, including PDF. If interested, feel free to use the following code to generate a PDF of your documentation. - -[source,bash] ----- -module load texlive/20200406 -python -m sphinx.cmd.build -M latexpdf $HOME/project3/docs/source $HOME/project3/docs/build ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc deleted file mode 100644 index 5cc1b771b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project04.adoc +++ /dev/null @@ -1,334 +0,0 @@ -= STAT 39000: Project 4 -- Fall 2021 - -== Write it. Test it. Change it. https://www.youtube.com/watch?v=7hPX_SresUM[Bop it?] - -**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have. - -**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data. - -**Scope:** Python, testing, pytest, mypy, doc tests - -.Learning Objectives -**** -- Write and run unit tests using `pytest`. -- Include and run doc tests in your docstrings, using `doctest`. -- Gain familiarity with `mypy`, and explain why static type checking can be useful. -- Comprehend what a function is, and the components of a function in Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/apple/health/2021/*` - -== Questions - -[WARNING] -==== -At the end of this project, you will need to submit the following: - -- A notebook (`.ipynb`) with the output of running your tests, and any other code. -- An updated `watch_data.py` file with the doctests you wrote, fixed function, and custom function. -- A `test_watch_data.py` file with the `pytest` tests you wrote. -==== - -=== Question 1 - -XPath expressions, while useful, have a very big limitation: the entire XML document must be read into memory. The is a problem for large XML documents. For example, to parse the `export.xml` file in the Apple Health data, takes nearly 7GB of memory when the file is only 980MB. - -.prof.py -[source,python] ----- -from memory_profiler import profile - -@profile -def main(): - import lxml.etree - - tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml") - -if __name__ == '__main__': - main() ----- - -[source,bash] ----- -python -m memory_profiler prof.py ----- - -.Output ----- -Filename: prof.py - -Line # Mem usage Increment Occurences Line Contents -============================================================ - 3 36.5 MiB 36.5 MiB 1 @profile - 4 def main(): - 5 38.5 MiB 2.0 MiB 1 import lxml.etree - 6 - 7 6975.3 MiB 6936.8 MiB 1 tree = lxml.etree.parse("/home/kamstut/apple_health_export/export.xml") ----- - -This is a _very_ common problem, not just for reading XML files, but for dealing with larger dataset in general. You will not always have an abundance of memory to work with. - -To get around this issue, you will notice we take a _streaming_ approach, where only parts of a file are read into memory at a time, processed, and then freed. - -Copy our library from `/depot/datamine/data/apple/health/apple_watch_parser` and import it into your code cell for question 1. Examine the code and test out at least 2 of the methods or functions. - -[TIP] -==== -To copy the library run the following in a new cell. - -[source,ipython] ----- -%%bash - -cp -r /depot/datamine/data/apple/health/apple_watch_parser $HOME ----- - -To import and use the library, make sure your notebook (let's say `my_notebook.ipynb`) is in the same directory (the `$HOME` directory) as the `apple_watch_parser` directory. Then, you can import and use the library as follows. - -[source,python] ----- -from apple_watch_parser import watch_data - -dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021") -print(dat) ----- -==== - -[TIP] -==== -You may be asking yourself "well, what does that `dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021")` line do, other than let us print the `WatchData` object?" The answer is, access and utilize the methods _within_ the `WatchData` object using the dot notation. Any function _inside_ the `WatchData` class is called a _method_ and can be access using the dot notation. - -For example, if we had a function called `my_function` that was declared _inside_ the `WatchData` class, we would call it as follows: - -[source,python] ----- -from apple_watch_parser import watch_data - -dat = watch_data.WatchData("/depot/datamine/data/apple/health/2021") -dat.my_function(argument1, argument2) ----- - -Hopefully this is a good hint on how to use the dot notation to call methods in the `WatchData` class! -==== - -[TIP] -==== -If you run `help(watch_data.time_difference)`, you will get some nice info about the function including a note "Given two strings in the format matching the format in Apple Watch data: YYYY-MM-DD HH:MM:SS -XXXX". What does this mean? These are date/time format code (see https://strftime.org/[here]). - -Let's say you have a string `2018-05-21 04:35:49 -0500`, and you want to convert it to a datetime object. To do so you would run the following. - -[source,python] ----- -import datetime - -my_datetime_string = '2018-05-21 04:35:49 -0500' -my_datetime = datetime.datetime.strptime(my_datetime_string, '%Y-%m-%d %H:%M:%S %z') ----- - -The string '%Y-%m-%d %H:%M:%S %z' are format codes (see https://strftime.org/[here]). In order to convert from a string to a datetime object, you need to use a combination of format codes that _match_ the format of the string. In this case, the string is '2018-05-21 04:35:49 -0500'. The "2018" part matches "%Y" from the format codes. The "05" part matches "%m" from the format codes. The "21" part matches "%d" from the format codes. The "04" part matches "%H" from the format codes. The "35" part matches "%M" from the format codes. The "49" part matches "%S" from the format codes. The " -0500" part matches "%z" from the format codes. If your datetime string follows a different format, you would need to modify the combination of format codes to use so it matches your datetime string. - -Then, once you have a datetime object, you can do all sorts of fun things. The most obvious of which is converting the date back into a string, but formatting it exactly how you want. For example, lets say we dont want a string to have all the details '2018-05-21 04:35:49 -0500' has, and instead just want the month, day, and year using forward slashes instead of hyphens. - -[source,python] ----- -my_datetime.strftime('%m/%d/%Y') # '05/21/2018' ----- -==== - -.Items to submit -==== -- Code used to solve this problem -- code that imports and uses our library and at least 2 of the methods or functions. -- Output from running the code that uses 2 of the methods. -==== - -=== Question 2 - -As you may have noticed, the code contains fairly thorough docstrings. This is a good thing, and it is a good goal to aim for when writing your own Python functions, classes, modules, etc. - -In the previous project, you got a small taste of using `doctest` to test your code using in-comment code. This is a great way to test parts of your code that are simple, straightforward, and don't involve extra data or _fixtures_ in order to test. - -Examine the code, and determine which functions and/or methods are good candidates for doctests. Modify the docstrings to include at least 3 doctests each, and run the following to test them out! - -Include the following doctest in the `calculate_speed` function. This does _not_ count as 1 of your 3 doctests for this function. It _will_ fail for this question -- that is okay! - -[source,python] ----- ->>> calculate_speed(5.0, .55, output_distance_unit = 'm') -Traceback (most recent call last): - ... -ValueError: output_distance_unit must be 'mi' or 'km' ----- - -[IMPORTANT] -==== -Make sure to include the expected output of each doctest below each line starting with `>>>`. This means in the code chunk shown above, you should include the "Traceback", "...", and "ValueError" lines as the expected output. Literally just copy and paste that entire code chunk into the `calculate_speed` docstring. -==== - -[source,ipython] ----- -%%bash - -python $HOME/apple_watch_parser/watch_data.py -v ----- - -[TIP] -==== -If you need to read in data or type a lot in order to use a function or method, a doctest is probably not the right approach. Hint, hint, try the functions rather than methods. -==== - -[TIP] -==== -There are 2 _functions_ that are good candidates for doctests. -==== - -[TIP] -==== -Don't forget to add the following code to the bottom of `watch_data.py` so doctests will run properly. - -[source,python] ----- -if __name__ == '__main__': - import doctest - doctest.testmod() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -In question 2, we wrote a doctest for the `calculate_speed` function. Figure out why the doctest fails, and make modifications to the function so it passes the doctest. Do _not_ modify the doctest. - -[CAUTION] -==== -When you update the `calculate_speed` function, be sure to first save the `watch_data.py` file and then re-import the package so that way your modifications take effect. -==== - -[TIP] -==== -Remember we want you to change the `calculate_speed` function to pass the doctest -- not change the doctest to make it pass. -==== - -[TIP] -==== -The output of `calculate_speed(5.0, .55, output_distance_unit = 'm')` is `9.09090909090909`, but we _want_ it to be `ValueError: output_distance_unit must be 'mi' or 'km'` because 'm' isn't one of the two valid values, 'mi' or 'km'. Modify the `calculate_speed` function so it raises that error when the `output_distance_unit` parameter is not one of the two valid values. -==== - -[TIP] -==== -Look carefully at the `_convert_distance` helper function -- that is where you will want to make modifications. Your logic within each `distance_unit` if statement should be along the lines of: "Is the `output_distance_unit` parameter 'mi'? If so, convert and/or return this distance. Is it 'km'? If so, convert and/or return this distance. Otherwise, raise an error because `output_distance_unit` should only be 'mi' or 'km'." -==== - -To run the doctest: - -[source,ipython] ----- -%%bash - -python $HOME/apple_watch_parser/watch_data.py -v ----- - -This is what doctests are for! This helps you easily identify that something fundamental has changed and the code isn't ready for production. You can imagine a scenario where you automatically run all doctests automatically before releasing a new product, and having that system notify you when a test fails -- very cool! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -While doctests are good for simple testing, a package like `pytest` is better. For the stand alone functions, write at least 2 tests each using `pytest`. Make sure these tests test _different_ inputs than your doctests did -- its not hard to come up with lots of tests! - -[NOTE] -==== -This could end up being just 2 functions that run a total of 4 tests -- that is okay! As long as each function has at least 2 assert statements. -==== - -Start by adding a new file called `test_watch_data.py` to your `$HOME/apple_watch_parser` directory. Then, fill the file with your tests. When ready to test, run the following in a new cell. - -[source,ipython] ----- -%%bash - -cd $HOME/apple_watch_parser -python -m pytest ----- - -[NOTE] -==== -You may have noticed that we arbitrarily chose to place some functions _outside_ of our `WatchData` class, and others inside. There is no hard and fast rule to determine if a function belongs inside or outside of a class. In general, however, if a function is related to the class, and works with the attributes/data of the class, it should be inside the class. If the function has no relationship to the class, or could be useful using other types of data, it should be outside of the class. - -Of course, there are exceptions to this rule, and it is possible to write _static_ methods for a class, which operate independently of the class and its attributes. We chose to write the functions outside of the class, more for demonstration purposes than anything else. They are functions that would most likely not be useful in any other context, but sort of demonstrate the concept and allow us to have good functions to practice writing doctests and `pytest` tests _without_ fixtures. -==== - -In the following project, we will continue to learn about `pytest`, including some more advanced features, like fixtures. - -**Relevant topics:** xref:book:python:pytest.adoc[pytest] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Explore the data -- there is a lot! Think of a function that could be useful for this module that would live _outside_ of the `WatchData` class. Write the function. Include Google style docstrings, doctests (at least 2), and `pytest` tests (at least 2, _different_ from your doctests). Re-run both your `doctest` tests and `pytest` tests. - -[NOTE] -==== -You can simply add this function to your `watch_data.py` module, and run the tests just like you did for the previous questions! -==== - -[NOTE] -==== -Your function doesn't _need_ to be useful for data outside the `WatchData` class (you won't lose credit if it isn't really), but make an attempt! There are more types of elements and data that you can look at too other than just the `Workout` tags in the `export.xml` file. There is GPX data (xml data that can be used to map a workout route) in the `/depot/datamine/data/apple/health/2021/workout-routes/` directory. Lots of options! -==== - -[TIP] -==== -One way to peek around at the data (without having your notebook/kernel crash due to out of memory (OOM) errors) is something like the following: - -[source,python] ----- -from lxml import etree - -tree = etree.iterparse("/depot/datamine/data/apple/health/2021/export.xml") -ct = 0 -for event, element in tree: - if element.tag == 'Workout': - print(etree.tostring(element)) - ct += 1 - if ct > 100: - break - else: - element.clear() - -# to extract an element's attributes -element.attrib # dict-like object ----- -==== - -**Relevant topics:** xref:book:python:pytest.adoc[pytest], xref:book:data:html.adoc[html], xref:book:data:xml.adoc[xml] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc deleted file mode 100644 index aa4f5275e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project05.adoc +++ /dev/null @@ -1,362 +0,0 @@ -= STAT 39000: Project 5 -- Fall 2021 - -== Testing Python: part II - -**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have. - -**Context:** This is the second in a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, and `mypy`, while writing code to manipulate and work with data. - -**Scope:** Python, testing, pytest, mypy - -.Learning Objectives -**** -- Write and run unit tests using `pytest`. -- Include and run doc tests in your docstrings, using `doctest`. -- Gain familiarity with `mypy`, and explain why static type checking can be useful. -- Comprehend what a function is, and the components of a function in Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/apple/health/2021/*` - -== Questions - -[WARNING] -==== -At this end of this project you will have only 3 files to submit: - -- `watch_data.py` -- `test_watch_data.py` -- `firstname-lastname-project05.ipynb` - -Make sure that the output from running the cells is displayed and saved in your notebook before submitting. -==== - -=== Question 1 - -First, setup your workspace in Brown. Create a new folder called `project05` in your `$HOME` directory. Make sure your `apple_watch_parser` package from the previous project is in your `$HOME/project05` directory. In addition, create your new notebook, `firstname-lastname-project05.ipynb` in your `$HOME/project05` directory. Great! - -[NOTE] -==== -An updated `apple_watch_parser` package will be made available for this project on Saturday, September 25, in `/depot/datamine/data/apple/health/apple_watch_parser02`. To copy this over to your `$HOME/project05` directory, run the following in a terminal on Brown. - -[source,bash] ----- -cp -R /depot/datamine/data/apple/health/apple_watch_parser02 $HOME/project05/apple_watch_parser ----- - -Note that this updated package is not required by any means to complete this project. -==== - -Now, in `$HOME/project05/apple_watch_parser/test_watch_data.py`, modify the `test_time_difference` function to use parametrizing to test the time difference between 100 sets of times. - -[TIP] -==== -The following is an example of how to generate a list of 100 datetimes 1 day and 1 hour apart. - -[source,python] ----- -import pytz -import datetime - -start_time = datetime.datetime.now(pytz.utc) -one_day = datetime.timedelta(days=1, hours=1) - -list_of_datetimes = [start_time+one_day*i for i in range(100)] ----- -==== - -[TIP] -==== -An example of how to convert a datetime to a string in the same format our `time_difference` function expects is below. - -[source,python] ----- -import pytz -import datetime - -my_datetime = datetime.datetime.now(pytz.utc) -my_string = my_datetime.strftime('%Y-%m-%d %H:%M:%S %z') ----- -==== - -[TIP] -==== -See the very first example https://docs.pytest.org/en/6.2.x/parametrize.html[here] for how to parametrize a test for a function accepting 2 arguments instead of 1. -==== - -[TIP] -==== -The `zip` function in Python will be particularly useful. Note in first example https://docs.pytest.org/en/6.2.x/parametrize.html[here], that the second argument to the `@pytest.mark.parametrize()` decorator is a list of tuples. `zip` accepts _n_ lists of _m_ elements, and returns a list of _m_ tuples, where each tuple contains the elements of the lists in the same order. - -[source,python] ----- -zip([1,2,3], [5,5,5], [9 for i in range(3)]) ----- -==== - -[TIP] -==== -You do _not_ need to manually calculate the expected result for each combination of datetime's that you will pass to the `time_difference` function. Since you know exactly how many seconds you put between the datetime's you automatically generated, you can just use those values for the expected result. For example, if you generated 100 datetimes that are each 1 day apart, you will know that the expected difference is 86400 seconds, and could pass a list of the value 86400 repeated 100 times to the `test_time_difference` function's third argument (the expected results). -==== - -Run the `pytest` tests from a bash cell in your notebook. - -[source,ipython] ----- -%%bash - -cd $HOME/project05 -python -m pytest ----- - -**Relevant topics:** xref:book:python:pytest.adoc#parametrizing-tests[parametrizing tests] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Read and understand the `filter_elements` method in the `WatchData` class. This is an example of a function that accepts another function as an argument. A term for functions that accept other functions as arguments are _higher-order functions_. Think about the `filter_elements` method, and list at least 2 reasons _why_ this may be a good idea. - -The docstring contains an explanation of the `filter_elements` method. In addition, the module provides a function called `example_filter` that can be used with the `filter_elements` method. - -Use the `example_filter` function to filter the `WatchData` class, and print the first 5 results. - -[TIP] -==== -Remember to import and use the package, make sure that the notebook is in the same directory as the `apple_watch_parser` package. - -[source,python] ----- -from apple_watch_parser import watch_data - -dat = watch_data.WatchData('/depot/datamine/data/apple/health/2021/') -print(dat) ----- -==== - -[TIP] -==== -When passing a function as an argument to another function, you should _not_ include the opening and closing parentheses in the argument. For example, the following is _not_ correct. - -[source,python] ----- -dat.filter_elements(example_filter()) ----- - -Why? Because the `example_filter()` part will try to _evaluate_ the function and will essentially be translated into the output of running `example_filter()`, and we don't want it to. We want to pass the function itself, so that the `filter_elements` method can _use_ the `example_filter` function internally. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write your own `*_filter` function in a Python code cell in your notebook (like `example_filter`) that can be used with the `filter_elements` method. Be sure to include a Google style docstring (no doctests are needed). - -Does it work as intended? Print the first 5 results when using your filter. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -In the previous project, we did _not_ test out the `filter_elements` method in our `WatchData` class. Testing this method is complicated for two main reasons. - -. The method accepts _any_ function following a set of rules (described in our docstring) as an argument. This (a `*_filter` function) may not be something that is available after immediately importing the `WatchData` class -- normally there wouldn't be an `example_filter` function in the module for you to use, as this would be a function that a user of the library would create for their own purposes. -. In order to be able to test the `filter_elements` method, we would need a dataset that is similarly structured as the intended dataset (Apple Watch exports), that we _know_ the expected output for, so we can test. - -`pytest` supports writing fixtures that can be used to solve these problems. - -To address problem (1): - -- Remove the `example_filter` function from the `watch_data.py` module, and instead, modify the `test_watch_data.py` file and add the `example_filter` function to the `test_watch_data.py` module as a `pytest` fixture. Read https://docs.pytest.org/en/6.2.x/fixture.html#what-fixtures-are[this] and 2 or 3 following sections. In addition, see https://stackoverflow.com/a/44701916[this] stackoverflow answer to better understand how to create a fixture that is a function that can accept arguments. - -[CAUTION] -==== -You may need to import lxml and other libraries in your `test_watch_data.py` file. For safety, you can just add the following. - -[source,python] ----- -import watch_data -import pytest -from pathlib import Path -import os -import lxml.etree -import pytz -import datetime ----- -==== - -[NOTE] -==== -Why do we need to do something like the stackoverflow post describes? The reason is, by default, `pytest` will assume that the argument, `element`, to the `example_filter` function is a fixture itself, and won't work! This is the workaround. -==== - -[TIP] -==== -In the example in the https://stackoverflow.com/a/44701916[stackoverflow post], the `_method(a,b)` function is the equivalent of the `example_filter` function. - -As a side note, sometimes helper functions (functions defined and used inside of another function) are called helper functions, and it is good practice to name them starting with an underscore -- just like the `_method(a,b)` function in the stackoverflow post. -==== - -[TIP] -==== -You can start by cutting the `example_filter` function from `watch_data.py` and paste it in `test_watch_data.py`. Then, to make it a _fixture_, wrap it in another function just like in the https://stackoverflow.com/a/44701916[stackoverflow post]. -==== - -To address problem (2): - -- Create a new `test_data` directory in your `apple_watch_parser` package. So, `$HOME/project05/apple_watch_parser/test_data` should now exist. Add `/depot/datamine/data/apple/health/2021/sample.xml` to this directory, and rename it to `export.xml`. So, `$HOME/project05/apple_watch_parser/test_data/export.xml` should now exist. -+ -[NOTE] -==== -`sample.xml` is a small sample of the the watch data that we can use for out tests. It is small enough to be portable, yet is similar enough to the intended types of datasets that it will be a good way to test our `WatchData` class and its methods. Since we renamed it to `export.xml`, it will work with our `WatchData` class. -==== -+ -- Create a `test_filter_elements` function in your `test_watch_data.py` module. Use https://pypi.org/project/pytest-datafiles/[this] library (already installed), to handle properly copying the `test_data/export.xml` file to a temporary directory for the test. Examples 2 and 3 https://pypi.org/project/pytest-datafiles/[here] will be particularly helpful. -+ -[NOTE] -==== -You may be wondering _why_ we would want to use this library for our test rather than just hard-coding the path to our test files in our test function(s). The reason is the following. What if one of your functions had a side-effect that _modified_ your test data? Then, any other tests you run using the same data would be tainted and potentially fail! Bad news. This package allows for a systematic way to first copy our test data to a temporary location, and _then_ run our test using the data in that temporary location. - -In addition, if you have many test function that work on the _same_ dataset, you can do something like the following to re-use the code over and over again. - -[source,python] ----- -export_xml_decorator = pytest.mark.datafiles(...) - -@export_xml_decorator -def test_1(datafiles): - pass - -@export_xml_decorator -def test_2(datafiles): - pass ----- - -Each of the tests, `test_1` and `test_2`, will work on the same example dataset, but will have a fresh copy of the dataset each time. Very cool! -==== -+ -[TIP] -==== -The decorator, `@pytest.mar.datafiles()` is expecting a path to the test data, `export.xml`. To get the absolute path to the test data, `$HOME/project05/apple_watch_parser/test_data/export.xml`, you can use the `pathlib` library. - -.test_watch_data.py -[source,python] ----- -import watch_data # since watch_data.py is in the same directory as test_watch_data.py, we can import it directly -from pathlib import Path - -# To get the path of the watch_data Python module -this_module_path = Path(watch_data.__file__).resolve().parent -print(this_module_path) # $HOME/project05/apple_watch_parser - -# To get the test_data folders absolute path, we could then do -print(this_module_path / 'test_data') # $HOME/project05/apple_watch_parser/test_data - -# To get the test_data/export.xml absolute path, we could then do ...? -# HINT: The answer to this question is _exactly_ what should be passed to the `@pytest.mark.datafiles()` decorator. -@pytest.mark.datafiles(answer_here) -def test_filter_elements(datafiles, example_filter_fixture): # replace example_filter_fixture with the name of your fixture function - pass ----- -==== - -Okay, great! Your `test_watch_data.py` module should now have 2 additional functions, "symbolically" something like this: - -[source,python] ----- -# from https://stackoverflow.com/questions/44677426/can-i-pass-arguments-to-pytest-fixtures -@pytest.fixture -def my_fixture(): - - def _method(a, b): - return a*b - - return _method - -@pytest.mark.datafiles(answer_here) -def test_filter_elements(datafiles, my_fixture): - pass ----- - -Fill in the `test_filter_elements` function with at least 1 `assert` statements that tests the `filter_elements` function. It could be as simple as comparing the length of the output when using the `example_filter` function as our filter. `test_data/example.xml` should return 2 elements using our `example_filter` function as the filter. - -[TIP] -==== -As a reminder, to run `pytest` from a bash cell in your notebook (which should be in the same directory as your `apple_watch_parser` directory, or `$HOME/project05/apple_watch_parser/firstname-lastname-project05.ipynb`), you can run the following. - -[source,ipython] ----- -%%bash - -cd $HOME/project05 -python -m pytest ----- -==== - -[NOTE] -==== -If you get an error that says pytest.mark.datafiles isn't defined, or something similar, do not worry, this can be ignored. Alternatively, if you add a file called `pytest.ini` to your `$HOME/project05` directory, with the following contents, this warning will go away. - -.pytest.ini ----- -[pytest] -markers = - datafiles: mark a test as a datafiles. ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Create an additional method in the `WatchData` class in the `watch_data.py` module that does something interesting or useful with the data. Be sure to include a Google style docstring (no doctests are needed). In addition, write 1 or more `pytest` tests for your new method that uses fixtures. Make sure your test passes (you can run your `pytest` tests from a `bash` cell in your notebook). - -If you are up for a bigger challenge, design your new method to be similar to `filter_elements` in that a user can write their own functions or classes that can be passed to it (as arguments) in order to accomplish something useful that they _may_ want to be customized. - -[IMPORTANT] -==== -We will count the use of the `@pytest.mark.datafiles()` decorator as a fixture, if you decide to not complete the "bigger challenge". -==== - -Make sure to run the `pytest` tests from a bash cell in your notebook. - -[source,ipython] ----- -%%bash - -cd $HOME/project05 -python -m pytest ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc deleted file mode 100644 index bd34b2178..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project06.adoc +++ /dev/null @@ -1,374 +0,0 @@ -= STAT 39000: Project 6 -- Fall 2021 - -== Sharing Python code: Virtual environments & git part I - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code. - -**Context:** This is the first in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you. - -**Scope:** Python, virtual environments, git - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Use git to create a repository and commit changes to it. -- Understand and utilize common `git` commands and workflows. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[NOTE] -==== -While this project may _look_ like it is a lot of work, it is probably one of the easiest projects you will get this semester. The question text is long, but it is mostly just instructional content and directions. If you just carefully read through it, it will probably take you well under 1 hour to complete! -==== - -=== Question 1 - -Sign up for a free GitHub account at https://github.com[https://github.com]. If you already have a GitHub account, perfect! - -Once complete, type your GitHub username into a markdown cell. - -.Items to submit -==== -- Your GitHub username in a markdown cell. -==== - -=== Question 2 - -We've created a repository for this project at https://github.com/TheDataMine/f2021-stat39000-project6. You'll quickly see that the code will be ultra familiar to you. The goal of this question, is to xref:book:git:git.adoc#clone[clone] the repository to your `$HOME` directory. Some of you may already be rushing off to your Jupyter Notebook to run the following. - -[source,ipython] ----- -%%bash - -git clone https://github.com/TheDataMine/f2021-stat39000-project6 ----- - -Don't! Instead, we are going to take the time to setup authentication with GitHub using SSH keys. Don't worry, it's _way_ easier than it sounds! - -[NOTE] -==== -P.S. As usual, you should have a notebook called `firstname-lastname-project06.ipynb` (or something similar) in your `$HOME` directory, and you should be using `bash` cells to run and track your `bash` code. -==== - -The first step is to create a new SSH key pair on Brown, in your `$HOME` directory. To do that, simply run the following in a bash cell. - -[IMPORTANT] -==== -If you know what an SSH key pair is, and already have one setup on Brown, you can skip this step. -==== - -[source,ipython] ----- -%%bash - -ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "lastname_brown_key" ----- - -When prompted for a passphrase, just press enter twice _without_ entering a passphrase. If it doesn't prompt you, it probably already generated your keys! Congratulations! You have your new key pair! - -So, what is a key pair, and what does it look like? A key pair is two files on your computer (or in this case, Brown). These files live inside the following directory `~/.ssh`. Take a look by running the following in a bash cell. - -[source,bash] ----- -ls -la ~/.ssh ----- - -.Output ----- -... -id_ed25519 -id_ed25519.pub -... ----- - -The first file, `id_ed25519` is your _private_ key. It is critical that you do not share this key with anybody, ever. Anybody in possession of this key can login to any system with an associated _public_ key, as _you_. As such, on a shared system (with lots of users, like Brown), it is critical to assign the correct permissions to this file. Run the following in a bash cell. - -[source,bash] ----- -chmod 600 ~/.ssh/id_ed25519 ----- - -This will ensure that you, as the _owner_ of the file, have the ability to both read and write to this file. At the same time, this prevents any other user from being able to read, write, or execute this file (with the exception of a superuser). It is also important get the permissions of files within `~/.ssh` correct, as `openssh` will not work properly otherwise (for safety). - -Great! The other file, `id_ed25519.pub` is your _public_ key. This is the key that is shareable, and that allows a third party to verify that "the user trying to access resource X has the associated _private_ key." First, lets set the correct permissions by running the following in a bash cell. - -[source,bash] ----- -chmod 644 ~/.ssh/id_ed25519.pub ----- - -This will ensure that you, as the _owner_ of the file, have the ability to both read and write to this file. At the same time, everybody else on the system will have read and execute permissions. - -Last, but not least run the following to correctly set the permission of the `~/.ssh` directory. - -[source,ipython] ----- -%%bash - -chmod 700 ~/.ssh ----- - -Now, take a look at the contents of your _public_ key by running the following in a bash cell. - -[source,ipython] ----- -%%bash - -cat ~/.ssh/id_ed25519.pub ----- - -Not a whole lot to it, right? Great. Copy this file to your clipboard. Now, navigate and login to https://github.com if you haven't already. Click on your profile in the upper-right-hand corner of the screen, and then click btn:[Settings]. - -[NOTE] -==== -If you haven't already, this is a fine time to explore the various GitHub settings, set a profile picture, add a bio, etc. -==== - -In the left-hand menu, click on btn:[SSH and GPG keys]. - -In the next screen, click on the green button that says btn:[New SSH key]. Fill in the "Title" field with anything memorable. I like to put a description that tells me where I generated the key (on what computer), for example, "brown.rcac.purdue.edu". That way, I can know if I can delete that key down the road when cleaning things out. In the "Key" field, paste your public key (the output from running the `cat` command in the previous code block). Finally, click the button that says btn:[Add SSH key]. - -Congratulations! You should now be able to easily authenticate with GitHub from Brown, how cool! To test the connection, run the following in a cell. - -[source,ipython] ----- -!ssh -o "StrictHostKeyChecking no" -T git@github.com ----- - -[NOTE] -==== -If you use the following -- you will get an error, but as long as it says "Hi username! ..." at the top, you are good to go! - -[source,ipython] ----- -%%bash - -ssh -T git@github.com ----- -==== - -If you were successful, it should reply with something like: - ----- -Hi username! You've successfully authenticated, but GitHub does not provide shell access. ----- - -[NOTE] -==== -If it asks you something like "Are you sure you want to continue connecting (yes/no)?", type "yes" and press enter. -==== - -Okay, FINALLY, let's get to the actual task! Clone the repository to your `$HOME` directory, using SSH rather than HTTPS. - -[TIP] -==== -If you navigate to the repository in the browser, click on the green "<> Code" button, you will get a dropdown menu that allows you to select "SSH", which will then present you with the string you can use in combination with the `git clone` command to clone the repository. -==== - -Upon success, you should see a new folder in your `$HOME` directory, `f2021-stat39000-project6`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Take a peek into your freshly cloned repository. You'll notice a couple of files that you may not recognize. Focus on the `pyproject.toml` file, and `cat` it to see the contents. - -The `pyproject.toml` file contains the build system requirements of a given Python project. It can be used with `pip` or some other package installer to download the _exact_ versions of the _exact_ packages (like `pandas`, for example) required in order to build and/or run the project! - -Typically, when you are working on a project, and you've cloned the project, you want to build the exact environment that the developer had set up when developing the project. This way you ensure that you are using the exact same versions of the same packages, so you can expect things to function the same way. This is _critical_, as the _last_ thing you want to have to deal with is figuring out _why_ your code is not working but the developers or project maintainers _is_. - -There are a variety of popular tools that can be used for dependency management and/or virtual environment management in Python. The most popular are: https://docs.conda.io/en/latest/[conda], https://pipenv.pypa.io/en/latest/[pipenv], and https://python-poetry.org/[poetry]. - -[NOTE] -==== -What is a "virtual environment"? In a nutshell, a virtual environment is a Python installation such that the interpreter, libraries, and scripts that are available in the virtual environment are distinct and separate from those in _other_ virtual environments or the _system_ Python installation. - -We will dig into this more. -==== - -There are pros and cons to each of these tools, and you are free to explore and use what you like. Having used each of these tools exclusively for at least 1 year or more, I have had the fewest issues with poetry. - -[NOTE] -==== -When I say "issues" here, I mean unresolved bugs with open tickets on the project's GitHub page. For that reason, we will be using poetry for this project. -==== - -Poetry was used to create the `pyproject.toml` file you see in the repository. Poetry is already installed in Brown. See where by running the following in a bash cell. - -[source,bash] ----- -which poetry ----- - -By default, when creating a virtual environment using poetry, each virtual environment will be saved to `$HOME/.cache/pypoetry`, while this is not particularly bad, there is a configuration option we can set that will instead store the virtual environment in a projects own directory. This is a nice feature if you are working on a shared compute space as it is explicitly clear where the environment is located, and theoretically, you will have access (as it is a shared space). Let's set this up. Run the following command. - -[source,ipython] ----- -%%bash - -poetry config virtualenvs.in-project true -poetry config cache-dir "$HOME/.cache/pypoetry" -poetry config --list ----- - -This will create a `config.toml` file in `$HOME/.config/pypoetry/config.toml` that is where your settings are saved. - -Finally, let's setup your _own_ virtual environment to use with your cloned `f2021-stat39000-project6` repository. Run the following commands. - -[source,bash] ----- -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project6 -poetry install ----- - -[IMPORTANT] -==== -This may take a minute or two to run. -==== - -[NOTE] -==== -Normally, you'd be able to skip the `module unload` part of the command, however, this is required since we are already _in_ a virtual environment (f2021-s2022 kernel). Otherwise, poetry would not install the packages into the correct location. -==== - -This should install all of the dependencies and the virtual environment in `$HOME/f2021-stat39000-project6/.venv`. To check run the following. - -[source,bash] ----- -ls -la $HOME/f2021-stat39000-project6/ ----- - -To actually _use_ this virtual environment (rather than our kernel's Python environment, or the _system_ Python installation), preface `python` commands with `poetry run`. For example, let's say we want to run a script in the package. Instead of running `python script.py`, we can run `poetry run python script.py`. Test it out! - -[WARNING] -==== -For each bash cell when running poetry commands -- it is critical the cells begin as follows: - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 ----- - -Otherwise, poetry will not use the correct Python environment. This is a side effect of the way we have our installation, normally, poetry will know to use the correct Python environment for the project. -==== - -We have a file called `runme.py` in the `scripts` directory (`$HOME/f2021-stat39000-project6/scripts/runme.py`). This script just quickly uses our package and prints some info -- nothing special. Run the script using the virtual environment. - -[IMPORTANT] -==== -You may need to provide execute permissions to the runme files. - -[source,bash] ----- -chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py -chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py ----- -==== - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -chmod 700 $HOME/f2021-stat39000-project6/scripts/runme.py -chmod 700 $HOME/f2021-stat39000-project6/scripts/runme2.py -cd $HOME/f2021-stat39000-project6 -poetry run python scripts/runme.py ----- - -[TIP] -==== -The script will print the location of the `pandas` package as well -- if it starts with `$HOME/f2021-stat39000-project6/.venv/` then you are correctly running the script using our environment! Otherwise, you are not and need to remember to use poetry. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Now, try to run the following script using our virtual environment: `$HOME/f2021-stat39000-project6/scripts/runme2.py`. What happens? - -[IMPORTANT] -==== -Make sure to run the script from the project folder and _not_ from the `$HOME` directory. `poetry` looks for a `pyproject.toml` file in the current directory, and if it doesn't find it, it will throw an error, but this error will not show you what package is missing. So, to be clear. Don't do: - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -poetry run python $HOME/f2021-stat39000-project6/scripts/runme2.py ----- - -But _do_ run: - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9. -cd $HOME/f2021-stat39000-project6 -poetry run python scripts/runme2.py ----- -==== - -It looks like a package wasn't found, and should be added to our environment (and therefore our `pyproject.toml` file). Run the following command to install the package to your virtual environment. - -[source,bash] ----- -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project6 -poetry add packagename # where packagename is the name of the package/module you want to install (that was found to be missing) ----- - -Does the `pyproject.toml` reflect this change? Now try and run the script again -- voila! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Read about at least 1 of the 2 git workflows listed xref:book:git:workflows.adoc[here] (if you have to choose 1, I prefer the "GitHub flow" style). Describe in words the process you would use to add a function or method to our repo, step by step, in as much detail as you can. I will start for you, with the "GitHub flow" style. - -. Add the function or method to the `watch_data.py` module in `$HOME/f2021-stat39000-project6/`. -. ... -. Deploy the the branch (this could be a website, or package being used somewhere) for final testing, before merging into the `main` branch where code should be pristine and able to be immediately deployed at any time and function as intended. -. ... - -[TIP] -==== -The goal of this question is to try as hard as you can to understand at a high level what a work flow like this enables, the steps involved, and think about it from a perspective of working with 100 other data scientists and/or software engineers. Any details, logic, or explanation you want to provide in the steps would be excellent! -==== - -[TIP] -==== -You do _not_ need to specify actual `git` commands if you do not feel comfortable doing so, however, it may come in handy in the next project (_hint hint_). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc deleted file mode 100644 index 42c7cdb91..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project07.adoc +++ /dev/null @@ -1,430 +0,0 @@ -= STAT 39000: Project 7 -- Fall 2021 - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code. - -**Context:** This is the second in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. This series is not intended to teach you everything you need to know, but rather to give you some exposure so the terminology and general ideas are not foreign to you. - -**Scope:** Python, virtual environments, git - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Use git to create a repository and commit changes to it. -- Understand and utilize common `git` commands and workflows. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -== Questions - -[NOTE] -==== -This will be another project light on code and data, which we will reintroduce in the next project. While the watch data is a pretty great dataset, I realize that perhaps its format is a distraction from the goal of the project, and not something you want to be fighting with as we being a series on using and writing APIs. We will begin to transition away from the watch data in this project, and instead use movie/tv-related data, which will be a _lot_ more fun to write an API for (hopefully). -==== - -=== Question 1 - -[CAUTION] -==== -If you did not complete project (6), you should go back and complete question (1) and question (2) before continuing. Don't worry, you just need to follow the instructions, there is no critical thinking for those 2 questions. If you get stuck, just write in Piazza. -==== - -As alluded to in question (5) from the previous project, in this project, we will put to work what you learned from the previous project! - -First, if you haven't already, create a `firstname-lastname-project07.ipynb` file in your `$HOME` directory. - -Review and read the content https://guides.github.com/introduction/flow/[here] on GitHub flow. GitHub flow is a workflow or pattern that you can follow that will help you work on the same codebase, with many others, at the same time. You may see where this is going, and it is a little crazy, but let's give this a try and see what happens. - -In this project, we will be "collaborating" with every other 39000 student, but mostly with me. Normally, you would all have explicit permissions in GitHub to work on and collaborate on the repositories in a given organization. For example, if I added you all to TheDataMine GitHub organization, you could simply clone the repository, create your own branch, make modifications, and push the branch up to GitHub. Unfortunately, since _technically_ you aren't in TheDataMine GitHub organization, you can't do that. Instead, you need to _fork_ our repository, clone your fork of the repository, create your own branch, make modifications, and push the branch up to GitHub. Just follow the instructions provided and it will be fine! - -Start by forking our repository. In a browser, navigate to https://github.com/TheDataMine/f2021-stat39000-project7, and in the upper right-hand corner, click the "Fork" button. - -[IMPORTANT] -==== -Make sure you are logged in to GitHub before you fork the repository! -==== - -image::figure15.webp[Fork the repository, width=792, height=500, loading=lazy, title="Fork the repository"] - -This will create a _fork_ of our original repository in _your_ GitHub account. Now, we want to clone _your_ fork of the repo! - -Clone your fork into your `$HOME` directory: - -- YourUserName/f2021-stat39000-project7 - -[IMPORTANT] -==== -Replace "YourUserName" with your GitHub username. -==== - -[NOTE] -==== -Sometimes, repositories will be shown as GitHubOrgName/RepositoryName or GitHubUserName/RepositoryName. The repos will be located at https://github.com/GitHubOrgName/RepositoryName and https://github.com/GitHubUserName/RepositoryName, respectively. When using SSH (which we are) to clone those repos, the strings would be git@github.com:GitHubOrgName/RepositoryName.git and git@github.com:GitHubUserName/RepositoryName.git, respectively. - -What does SSH vs HTTPS mean? Read https://docs.github.com/en/get-started/getting-started-with-git/about-remote-repositories[here] for more information. When cloning a repo using HTTPS, it will look something like: - -[source,bash] ----- -git clone https://github.com/user/repo.git ----- - -When cloning a repo using SSH, it will look something like: - -[source,bash] ----- -git clone git@github.com:user/repo.git ----- - -Both work fine, but I've had fewer issues with the latter, so that is what we will stick to for now. -==== - -[IMPORTANT] -==== -Make sure to run the clone command in a bash cell in your `firstname-lastname-project07.ipynb` file. -==== - -[NOTE] -==== -The result of cloning the repository will be a directory called `f2021-stat39000-project7` in your `$HOME` directory. Due to the nature of this project, your cloned repo may contain other students' code, if their code has been merged into the `main` branch -- cool! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Let's test things out to make sure they are working the way we intended. First, we can see that there is a `pyproject.toml` file and a `poetry.lock` file. Let's use poetry to build our virtual environment to run and test our code. - -In a bash cell in your notebook, run the following: - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project7 -poetry install ----- - -[NOTE] -==== -Recall that the `module unload` command is only needed due to the way we have things configured on Brown -- _typically_ it would be much more straightforward, and we would just run `poetry install`. -==== - -Great! Now, in the next bash cell, test out things by running the `runme.py` script. - -[source,ipython] ----- -%%bash - -# unload the module -module unload python/f2021-s2022-py3.9.6 - -# give execute permissions to the runme.py script -chmod 700 $HOME/f2021-stat39000-project7/scripts/runme.py - -# navigate to inside the project directory (this is needed because your notebook is in your $HOME directory) -cd $HOME/f2021-stat39000-project7 - -# run the runme.py script using our environment -poetry run python scripts/runme.py ----- - -If all went well, you should see something **similar** to the following output. - -.Output ----- -Pandas is here!: /home/kamstut/f2021-stat39000-project7/.venv/lib/python3.9/site-packages/pandas/__init__.py -^^^^^^^ -If that doesnt start with something like "$HOME/f2021-stat39000-project7/.venv/..., you did something wrong -IMDB data from: /depot/datamine/data/movies_and_tv/imdb.db -8.2 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, great! So far, so good. - -As a very important contributor to our new package, you will be adding a method to our `IMDB` class. This method should use the `aiosql` package to run a query (or more than one query) against the `imdb.db` database, and return some data or do something cool. As an alternative, your method could also do some sort of web scraping for IMDB. Your new method _must_ include a Google style docstring, and _must_ be non-trivial -- for example a method that returns the rating of a title or the name of a title is too simple. Any valid effort will be awarded full credit. - -[WARNING] -==== -Before continuing, let's follow the https://guides.github.com/introduction/flow/[first step] of the GitHub flow, and create our own branch to work on and commit changes to. Create a new branch called `firstname-lastname` from the `main` branch. Once created, _checkout_ the branch so it is your active branch. -==== - -[WARNING] -==== -Remember that the `git` commands should be run _inside_ the project folder, `$HOME/f2021-stat39000-project7`. Since our Jupyter notebook, `firstname-lastname-project07.ipynb`, is in the `$HOME` directory, we need to `cd` into the project directory before we can run the `git` commands, for **every** bash cell in our notebook (except for the bash cell where we are cloning the repository). To make it explicitly clear, every bash cell in your notebook that isn't cloning the repo should have: - -[source,bash] ----- -cd $HOME/f2021-stat39000-project7 ----- - -_Before_ you run the `git` commands. -==== - -Please take a look at the `get_rating` method in the `imdb.py` module for an example of a method. - -Please take a look at the `imdb_queries.sql` file, to see how a query is written using this package. https://nackjicholson.github.io/aiosql/defining-sql-queries.html[Here] is the official documentation for `aiosql`. - -[NOTE] -==== -Note that since we will _just_ be reading from the database, you will want to limit yourself to queries that are "Select One" (ending in a "^"), or "Select Value" (ending in a "$"), or "No Operator" (ending in no symbol). -==== - -Please take a look at `runme.py` to see how we used the `tdm_media` package. - -To make these additions to the package you will need to: - -. Modify the `imdb.py` module to add the new method. -+ -[WARNING] -==== -For simplicity, call your new method `firstname_lastname` in the `imdb.py` module. Where you would replace `firstname` and `lastname` with your first and last name, respectively. -==== -+ -[NOTE] -==== -If you want to have examples of `title_id` values and `person_id` values, look no further than https://imdb.com! For example, let's say I want Peter Dinklage's person_id -- to get this, all I have to do is search for him on the IMDB website. I will be sent to a link similar to the following. - -https://www.imdb.com/name/nm0227759 - -Here, you can see Peter Dinklage's person_id in the URL itself! It is "nm0227759". - -Same for title_ids -- simply search for the movie or tv show or tv show episode you are curious about, and the `title_id` will be right in the URL. -==== -. Modify the `imdb_queries.sql` file to add any new queries you need in order to get your `firstname_lastname` method working. -+ -[WARNING] -==== -For simplicity, call your new queries `firstname_lastname_XX` in the `imdb_queries.sql` file. Where you would replace `firstname` and `lastname` with your first and last name, respectively, and you would replace `XX` with a counter like `01`, `02`, etc. - -For example, if I had two queries my additions would look something like this: - -.imdb_queries.sql -[source,sql] ----- --- name: kevin_amstutz_01$ --- Get the rating of the movie/tv episode/short with the given id -SELECT rating FROM ratings WHERE title_id = :title_id; - --- name: kevin_amstutz_02$ --- Get the rating of the movie/tv episode/short with the given id -SELECT rating FROM ratings WHERE title_id = :title_id; ----- -==== -+ -. Create a new script in the scripts directory called `firstname_lastname.py`. -+ -[TIP] -==== -The following is some boilerplate code for your `firstname_lastname.py` script. - -[source,python] ----- -import sys -from pathlib import Path -sys.path.insert(0, str(Path(__file__).resolve().parents[1])) - -from tdm_media.imdb import IMDB -import pandas as pd - -def main(): - - dat = IMDB("/depot/datamine/data/movies_and_tv/imdb.db") - - # code to use your method here, for example: - print(dat.get_rating("tt5180504")) - -if __name__ == '__main__': - main() ----- -==== -+ -. Finally, if your new method uses a library not already included in our environment, you will need to install it. -+ -[TIP] -==== -To add the library (if and only if it is needed): - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project7 -poetry add thedatamine ----- - -Replace "thedatamine" with the name of the package you need. -==== - -Great! Once you've made these modifications, in a bash cell, run your new script and see if the output is what you expect it to be! - -[source,ipython] ----- -%%bash - -# unload the module -module unload python/f2021-s2022-py3.9.6 - -# give execute permissions to the firstname_lastname.py script -chmod 700 $HOME/f2021-stat39000-project7/scripts/firstname_lastname.py - -# navigate to inside the project directory (this is needed because your notebook is in your $HOME directory) -cd $HOME/f2021-stat39000-project7 - -# run the firstname_lastname.py script using our environment -poetry run python scripts/firstname_lastname.py ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Fantastic! We have implemented our new things, and we are ready to continue with the GitHub flow! - -In a bash cell, navigate to the root of the project directory, `$HOME/f2021-stat39000-project7`, and stage any new files you've created that you would like to commit. - -[source,ipython] ----- -%%bash - -cd $HOME/f2021-stat39000-project7 -git add . ----- - -Excellent! Now, _commit_ the new files and changes. Be sure to include a commit message that describes what you've done. - -[TIP] -==== -Using `git commit` requires having a message with your commit! To add a message, simply use the `-m` flag. So for example. - -[source,bash] ----- -git commit -m "This is my fantastic new function." ----- -==== - -[NOTE] -==== -Normally, you'd add and commit files and changes as you are writing the code. However, since this is all so new, we set this up so you just add and commit all at once. -==== - -The next step in the GitHub flow would be to open a pull request. First, before we do that, we have to _push_ the changes we've made locally, on Brown, to our _remote_ (GitHub). To do this, in a bash cell, run the following command: - -[source,ipython] ----- -%%bash - -cd $HOME/f2021-stat39000-project7 -git push --set-upstream origin firstname-lastname ----- - -[IMPORTANT] -==== -Replace firstname-lastname with your first and last name, respectively. It is the name of your branch you created in question (3). -==== - -Once run, if you navigate to your fork's GitHub page, https://github.com/YourUserName/f2021-stat39000-project7, you should be able to refresh the webpage and see your new branch in the dropdown menu for branches. - -image::figure07.webp[Looking at the branches, width=792, height=500, loading=lazy, title="Looking at the branches"] - -Awesome! Okay, now you are ready to open a pull request. A pull request needs to be opened in the browser. Navigate to the project page https://github.com/YourUserName/f2021-stat39000-project7, click on the "Pull requests" tab, then click on "New pull request". - -We want to create a pull request that merges your branch, `firstname-lastname`, into the `main` branch. Select your branch from the menu on the right side of the left arrow, and click "Create pull request". - -image::figure08.webp[Selecting what to merge, width=792, height=500, loading=lazy, title="Selecting what to merge"] - -image::figure09.webp[Screen when selected, width=792, height=500, loading=lazy, title="Screen when selected"] - -Enter the important information in the boxes. Describe what your function does, and why you want to merge it into the main branch. Once satisfied, in a comment box, write something like "@kevinamstutz Could you please review this?". - -image::figure10.webp[Filling out the pull request, width=792, height=500, loading=lazy, title="Filling out the pull request"] - -Click "Create pull request", and you should see a screen similar to the following. - -image::figure11.webp[Resulting screen, width=792, height=500, loading=lazy, title="Resulting screen"] - -Write back and forth with me at least once, and when you are good to go, I will write back and merge the PR. - -Take a screenshot of the final result, after the PR is merged. - -image::figure12.webp[Final result, width=792, height=500, loading=lazy, title="Final result"] - -[IMPORTANT] -==== -If I do not respond back and merge fast enough, it is OK to take a screenshot of the non-merged pull request page -- you will receive full credit. Try to wait though! I'm usually pretty quick! -==== - -Upload the screenshot to your `$HOME` directory, and include them using a markdown cell. - -[TIP] -==== -To include the image in a markdown cell, do the following. The following assumes your image is called `myimage.png` and is located in your `$HOME` directory. It also assumes your notebook is in the `$HOME` directory. - -[source,ipython] ----- -![](./myimage.png) ----- - -Then, run the cell! Your image will appear. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -Okay, what files should you submit for this project? Please submit the following: - -- `firstname-lastname-project07.ipynb` (your notebook). -- Your modified `imdb.py`, with your `firstname_lastname` method. -- Your modified `imdb_queries.sql` file with your added query(s). -- Your script, `firstname_lastname.py`, that uses your `firstname_lastname` method. -==== - -=== Question 5 (optional, 0 pts) - -You've now worked through the entire GitHub flow! That is really great! It definitely can take some time getting used to. If you have the time, and are feeling adventurous, and _excellent_ test of your skills would be to add something to this book! Clone this repository (git@github.com:TheDataMine/the-examples-book.git), add some content, and create a pull request! - -You can add a UNIX, R, Python, or SQL example, no problem! At some point in time, I'll review your addition and you will be an official contributor to the book! Why not? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc deleted file mode 100644 index 03d0bd994..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project08.adoc +++ /dev/null @@ -1,411 +0,0 @@ -= STAT 39000: Project 8 -- Fall 2021 - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time in which it takes to write a functioning program or scripts) and program speed (how fast your code runs). This is often the best choice depending on your staff and how much your software developers or data scientists earn. However, Python code does _not_ have the advantage of being able to be compiled to machine code for a certain architecture (x86_64, ARM, Darwin, etc.), and easily shared. In Python you need to learn how to use virtual environments (and git) to share your code. - -**Context:** This is the last project in a series of 3 projects that explores how to setup and use virtual environments, as well as some `git` basics. In addition, we will use this project as a transition to learning about APIs. - -**Scope:** Python, virtual environments, git, APIs - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Use git to create a repository and commit changes to it. -- Understand and utilize common `git` commands and workflows. -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/whin/whin.db` - -== Questions - -[NOTE] -==== -We are _so_ lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (https://data.whin.org/[here]) for educational purposes. You've most likely either used their API in a previous project, or you've worked with a sample of their data to solve some sort of data-driven problem. - -You can learn more about WHIN at https://whin.org/[here]. - -In this project, we are providing you with our own version of the WHIN API, so you can take a look under the hood, modify things, and have a hands-on experience messing around with an API written in Python! Behind the scenes, our API is connecting to a sqlite database that contains a small sample of the rich data that WHIN provides. -==== - -=== Question 1 - -In a https://thedatamine.github.io/the-examples-book/projects.html#p09-290[previous project], we used the `requests` library to build a CLI application that made calls to the WHIN API. - -Our focus in _this_ project will be to study the WHIN API (and other APIs), with the goal of learning about the components of an API in a hands-on manner. - -Before we _really_ dig in, it is well worth our time to do some reading. There is a _lot_ of information online about APIs. There are a _lot_ of opinions on proper API design. - -[NOTE] -==== -At no point in time will we claim that the way we are going to design our API is the best way to do it. However, we will try and learn from some of the most successful commercial APIs, mainly, the https://stripe.com/docs/api[Stripe API]. -==== - -First thing is first, let's clone our _homage_ to the WHIN API to prevent confusion, we will refer to this as **our** API. Run the following in a bash cell. - -[source,ipython] ----- -%%bash - -cd $HOME -git clone git@github.com:TheDataMine/f2021-stat39000-project8.git ----- - -Then, install the Python dependencies for this project by running the following code in a new bash cell. - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project8 -poetry install ----- - -Finally, let's see if we can _run_ this API on Brown! To do this, we will _not_ be running the API via a bash cell in Jupyter Lab. Instead, we will pop open a terminal, and have it running in another tab. - -Create a file called `.env` (that's it, no extension -- just a file called `.env` with the following text, in a single line) inside your `f2021-stat39000-project8` directory, with the following content. - ----- -DATABASE_PATH=/depot/datamine/data/whin/whin.db ----- - -[NOTE] -==== -A file starting with a period is a _hidden_ file. In UNIX-like systems, you need to add `-a` to the `ls` command to see hidden files. - -Give it a try: - -[source,bash] ----- -ls -la $HOME/f2021-stat39000-project8 ----- -==== - -Then, open a new terminal tab. Click on the blue "+" button in the top left corner of the Jupyter Lab interface. - -image::figure16.webp[Create new Terminal tab, width=792, height=500, loading=lazy, title="Create new Terminal tab"] - -Then, on your kernel selection screen, scroll down until you see the "Terminal" box. Select it to launch a fresh terminal on Brown. - -image::figure17.webp[Select Terminal, width=792, height=500, loading=lazy, title="Select Terminal"] - -The command to run the API is as follows. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load poetry/1.1.10 -cd $HOME/f2021-stat39000-project8 -poetry run uvicorn app.main:app --reload ----- - -Now, with that being said, it is not _quite_ so simple. We are running this API on Brown, a community cluster with _lots_ of other users, running _lots_ of other applications. By default, fastapi will run on local port 8000. What this means is that if you were on your personal computer, you could pop open a browser and navigate to `http://localhost:8000/` to see the API. The problem _here_ is you _each_ need to be running your API on your _own_ port -- and it is very likely port 8000 is already in use. - -So what are we going to do? Well, one option is to just choose a number, and run your API with _this_ command. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load poetry/1.1.10 -cd $HOME/f2021-stat39000-project8 -poetry run uvicorn app.main:app --reload --port XXXXX ----- - -Where XXXXX is a number generated using the command below. In a bash cell, run the following code. - -[source,bash] ----- -port ----- - -.Output ----- -21650 # your number may be different! ----- - -[IMPORTANT] -==== -You _must_ run this in a bash cell. This bash script lives in the `/scratch/brown/kamstut/tdm/bin` directory, which is _automatically_ added to your `$PATH` in our Jupyter Lab environment. -==== - -Then, given your _available_ port number, run the following from your terminal tab. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load poetry/1.1.10 -cd $HOME/f2021-stat39000-project8 -poetry run uvicorn app.main:app --reload --port 21650 # if your port was 1111 you'd replace 21650 with 1111 ----- - -[IMPORTANT] -==== -Replace 21650 with the port number from your `port` command you ran earlier. Every time you see 21650 in this project, replace it with **your** port number. -==== - -Once successful, you should see text _similar_ to the following. - ----- -INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project8'] -INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) -INFO: Started reloader process [94978] using watchgod -INFO: Started server process [94997] -INFO: Waiting for application startup. -INFO: Application startup complete. ----- - -Then, to _see_ the API, or the responses, _normally_ you could just navigate to http://localhost:21650, and enter the URLs there. By default, the browser will GET those responses. Since our compute environment is a little bit more complicated, we will limit GET'ing our responses using the `requests` package. - -Run the following in a cell. - -[source,python] ----- -import requests - -response = requests.get("http://localhost:21650") -print(response.json()) ----- - -You should be presented with an _extremely_ boring result -- a simple "hello world". Yay! You are running an API and even made a GET request to that API using the `requests` package. While this may or may not seem too cool to you, it is pretty awesome! I _hope_ these next few projects will be fun for you! - -[NOTE] -==== -Please send any feedback you may have to kamstut@purdue.edu/mdw@purdue.edu/datamine@purdue.edu. This is the _first_ time we are testing out these project ideas, so any feedback -- positive or negative -- is welcome! I've already made a lot of notes to make some of the earlier projects less time consuming. We ultimately want to make these projects fun, give you some exposure to cool techniques used in industry, and hopefully make you a better programmer/statistician/nurse/whathaveyou. With that being said, I have definitely missed the mark many times, and your feedback helps a lot. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Great! Now, you have **our** API running on Brown. Now its time to learn about what the heck an API is. There are a _lot_ of different types of APIs. The most common used today are RESTful APIs (what we will be focusing on, probably the most popular), graphQL APIs, and gRPC APIs. - -https://www.redhat.com/architect/apis-soap-rest-graphql-grpc[This] is a decent article highlighting the various types of APIs (feel free to skip the antiquated SOAP). Summarize the 3 mentioned APIs (RESTful, gRPC, and graphQL) in 1-2 sentences, and write at least 1 pro and 1 con of each. - -As I mentioned before, it makes the most sense to focus on RESTful APIs at this point in time, however, gRPC and graphQL have some serious advantages that make them very popular in industry. It is likely you will run into some of these in your future work. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Since it is not so straightforward to pull up the _automatically_ generated, interactive, API documentation, we've provided a screenshot below. - -image::figure18.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure19.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure20.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure21.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure22.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure23.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure24.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -image::figure25.webp[API Documentation, width=792, height=500, loading=lazy, title="API Documentation"] - -Awesome! There are some pretty detailed docs that we incorporated. - -Let's make a _request_ to our API. Once we make a _request_ to our API, we will receive a _response_ back. The main components of a request are: - -- The _method_ (GET, POST, PUT, DELETE, etc.) -- The _path_ (the URL path) -- The _headers_ (the HTTP headers) -- The _body_ (the data that is sent in the request) - -Thats it! - -The only method we will talk about in this project is the GET method. If you want a list of methods, simply Google "HTTP methods" and you should find a list of all the methods. - -The GET method is the same method that browsers primarily utilize when they navigate to a website. They GET the website content. - -The _path_ starts after the URL. In our case, the path was `/docs/` to get the docs! The path highlights the resource we are trying to access. - -The _headers_ are sent with the request and can be used for a wide variety of things. For example, in the next question, we will use a header to authenticate with the _real_ WHIN API and make a request. - -Finally, the _body_ is the data that is sent with the request. In our case, we will not be sending any data with our request, instead, we will be receiving data in the body of our _response_. - -To make a response to our API, we can use the `requests` package. Run the following in a Python cell. - -[source,python] ----- -import requests - -response = requests.get('http://localhost:21650/stations/') ----- - -`response` will then contain your -- response! If you look over in your terminal tab, you will see that **our** API logged the request we made. - -The response will contain a status code. You can see a list of status codes, and what they mean https://developer.mozilla.org/en-US/docs/Web/HTTP/Status[here]. - -To get the status code from your `response` variable, try the following. - -[source,python] ----- -response.status_code ----- - -Run the following to get a list of the methods and attributes available to you with the response object. - -[source,python] ----- -dir(response) ----- - -You can see a lot -- this is a useful "trick" in python. Alternatively, like most dunder methods, you could also run the following. - -[source,python] ----- -response.__dir__() ----- - -This is the same as: - -[source,python] ----- -dir(response) ----- - -Okay, great! - -You can get the headers like this: - -[source,python] ----- -response.headers ----- - -You can get the pure text of the response like this: - -[source,python] ----- -response.text ----- - -Finally, to the the JSON formatted body of the response, you can use the json method, which will return a list of dicts containing the data! - -[source,python] ----- -response.json() # the open and closed parenthesis are important. `json()` is a method not an attribute (like `.text`), so the parentheses are important. ----- - -As you _may_ have ascertained, the endpoint, `http://localhost:21650/stations/`, will return a list of station objects -- very cool! - -In another tab in your regular browser running on your local machine, navigate to the https://data.whin.org/data/current-conditions[official WHIN api docs] (you may need to login). Follow the directions at the beginning of https://thedatamine.github.io/the-examples-book/projects.html#p09-290[this project] to be able to authenticate with the WHIN API (questions 1 _and_ 2). - -Next, make sure you followed the instructions in question (2) from https://thedatamine.github.io/the-examples-book/projects.html#p09-290[this project] and that your `.env` file now contains something like: - -..env file ----- -DATABASE_PATH=/depot/datamine/data/whin/whin.db -MY_BEARER_TOKEN=eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjp7ImlkIjo5LCJkaXsdgw3ret234gBbXN0dXR6IiwiYWNjb3VudF90eXBlIjoiZWR1Y2F0aW9uIn0sImlhdCI6MTYzNDgyMzUyOSwibmJmIjoxNjM0ODIzNTI5LCJleHAiOjE2NjYzNTk1MjksImlzcyI6Imh0dHBzOi8vd2hpbi5vcmcifQ.LASER2vFONRhkdrPtEwca0eGxCtbjJ4Btaurgerg7l27z_Rwqhy1gghdFpscLFkFzfVw7VUdV_hlJ1rzmHi8i75hcLEUL18T76kdY82yb7Q8b_YTB32iQnJDP3uVQP5sQWs5mv8HcEj6W7jNX5HQe-iItzBXVAcMBUmR0SK9Pt2JRmCbuHpM242JJqwBvEMZw1mjNWGs70c595QqyxaUtgrSSmMBbZQeaN21U9EuSEjUKBRgtjl-9t-IhLkLVNo008Vq4v-sA ----- - -If you are having a hard time adding another line to your `.env` file, you can also run the following in a bash cell to _append_ the line to your `.env` file. **Make sure you replace the token with _your_ token.** - -[source,bash] ----- -echo "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjp7ImlkIjo5LCJkaXsdgw3ret234gBbXN0dXR6IiwiYWNjb3VudF90eXBlIjoiZWR1Y2F0aW9uIn0sImlhdCI6MTYzNDgyMzUyOSwibmJmIjoxNjM0ODIzNTI5LCJleHAiOjE2NjYzNTk1MjksImlzcyI6Imh0dHBzOi8vd2hpbi5vcmcifQ.LASER2vFONRhkdrPtEwca0eGxCtbjJ4Btaurgerg7l27z_Rwqhy1gghdFpscLFkFzfVw7VUdV_hlJ1rzmHi8i75hcLEUL18T76kdY82yb7Q8b_YTB32iQnJDP3uVQP5sQWs5mv8HcEj6W7jNX5HQe-iItzBXVAcMBUmR0SK9Pt2JRmCbuHpM242JJqwBvEMZw1mjNWGs70c595QqyxaUtgrSSmMBbZQeaN21U9EuSEjUKBRgtjl-9t-IhLkLVNo008Vq4v-sA" >> $HOME/f2021-stat39000-project08/.env ----- - -[IMPORTANT] -==== -You must replace the "MY_BEARER_TOKEN" with **your** token from https://data.whin.org/account[this page]. -==== - -When configured, make the following request. - -[source,python] ----- -import requests -import os -from dotenv import load_dotenv - -load_dotenv(os.getenv("HOME")+"/f2021-stat39000-project8/.env") - -my_headers = {"Authorization": f"Bearer {os.getenv('MY_BEARER_TOKEN')}"} -response = requests.get("https://data.whin.org/api/weather/stations", headers = my_headers) -print(response.json()) ----- - -You'll find that the responses are very similar -- but of course, ours is just a sample of theirs. - -Notice that the response is pretty long, but it is a _list_ of dictionaries, so we can easily print the first 5 values only, like this. - -[source,python] ----- -print(response.json()[:5]) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -You've successfully made a _request_ to both **our** API (which you are running in the terminal tab), and the WHIN API -- very cool! - -Read the documentation provided for **our** API in the screenshots in question (3), and make a request with a _query parameter_. A _query parameter_ is a parameter that is added to the URL itself. _Query parameters_ are added to the end of a URL. They start with a "?", have key, value pairs separated by "=", and many can be strung together using "&" to separate them. For example. - ----- -http://localhost:21650/some_endpoint?queryparam1key=queryparam1value&queryparam2key=queryparam2value ----- - -Here, we have 2 query parameters, `queryparam1key` and `queryparam2key`, and their values are `queryparam1value` and `queryparam2value`, respectively. - -In **our** API, there are a few endpoints that give you optional query parameters (see the images in question (3)) -- use the `requests` library to test it out and make a request involving at least 1 query parameter with any of the endpoints we provide with **our** API. - -Now, try and replicate the request using the original WHIN API -- were you able to fully replicate it? - -[NOTE] -==== -When we ask "were you able to fully replicate it", all we want to know is if the WHIN API happens to provide the same functionality. -==== - -The APIs are pretty different, and provide different functionalities. APIs are not the same, and depending on the purpose of you API, you may build it differently! Very cool! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Make a new request to **our** API, and use at least 2 query parameters in your request -- do the results make sense based on what you've read on the docs? Why or why not? - -In websites, a common feature is _pagination_ -- the ability to page through lots of results, one page at a time. Often times this will look like a "Next" and "Previous" button in a webpage. Which of the query parameters would be useful for pagination in our API and why? - -Finally, make a new request to the original WHIN API. Specifically, try and test out the very cool `current-conditions` endpoint that allows you to zone in on stations near a certain latitude and longitude location. Can you replicate this with our API, or do we not have that capability baked in? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc deleted file mode 100644 index 7b8dd3d13..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project09.adoc +++ /dev/null @@ -1,522 +0,0 @@ -= STAT 39000: Project 9 -- Fall 2021 - -**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs. - -**Context:** This is the second in a series of 4 projects focused around APIs. We will learn some basics about interacting and using APIs, and even build our own API. - -**Scope:** Python, APIs, requests, fastapi - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -== Questions - -=== Question 1 - -Begin this project by cloning our repo and installing the required packages. To do so, run the following in a bash cell. - -[source,ipython] ----- -%%bash - -cd $HOME -git clone git@github.com:TheDataMine/f2021-stat39000-project9.git ----- - -Then, to install the required packages, run the following in a bash cell. - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project9 -poetry install ----- - -Next, let's identify a port that we can run our API on. In a bash cell, run the following. - -[source,ipython] ----- -%%bash - -port ----- - -You will get a port number, like the following, for example. - -.Output ----- -1728 ----- - -From this point on, when we mention the port 1728, please replace it with the port number you were assigned. Open a new terminal tab so that we can run our API, alongside our notebook. - -Next, you'll need to add a `.env` file to your `f2021-stat39000-project9` directory, with the following content. (Pretty much just like the previous project!) - ----- -DATABASE_PATH=/depot/datamine/data/movies_and_tv/imdb.db ----- - -In **your terminal (not a bash cell)**, run the following. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load poetry/1.1.10 -cd $HOME/f2021-stat39000-project9 -poetry run uvicorn app.main:app --reload --port 1728 ----- - -Upon success, you should see some output similar to: - -.Output ----- -INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project9'] -INFO: Uvicorn running on http://127.0.0.1:1728 (Press CTRL+C to quit) -INFO: Started reloader process [25005] using watchgod -INFO: Started server process [25008] -INFO: Waiting for application startup. -INFO: Application startup complete. ----- - -Fantastic! Leave that running in your terminal, and test it out with the following request in a regular Python cell in your notebook. - -[CAUTION] -==== -Make sure to replace 1728 with the port number you were assigned. -==== - -[source,python] ----- -import requests -resp = requests.get("http://localhost:1728") -print(resp.json()) ----- - -You should receive a Hello World message, great! - -[TIP] -==== -Throughout this project, be patient waiting for your requests to complete -- sometimes they take a while. If it is taking too long, you can always try killing the server. To do so, open the terminal tab and hold ctrl and press c. This will kill the server. Once killed, just restart it using the same command you used previously to start it. - -Finally, there are now 2 places to check for errors and print statements: the terminal and the notebook. When you get an error be sure to check both for useful clues! Keep in mind that you only need to modify 3 files: `main.py`, `queries.sql`, and `imdb.py` (plus making the requests in your notebook). Don't worry about any of the other files, but feel free to look around if you want! -==== - -[TIP] -==== -Please test the requests in your notebook with the code we provide you. We've tested them and know that they work. If you choose to test them with a different movie/tv show/etc., you could get unexpected errors related to our `schemas.py` file -- best just to stick to the requests we provide. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Okay, so the goal of the next 4 or so questions is to put together the following API endpoints, that return simple JSON responses, with the desired data. You can almost think of this as one big fancy interface to return data from our database in JSON format -- that _is_ pretty much what it is! BUT we have the capability to do nice data-processing on the data _before_ it is returned, which can be difficult using _just_ SQL. - -The following are a list of endpoints that we _already_ have implemented for you, to help get you started. - -- `http://localhost:1728/movies/{title_id}` - -[NOTE] -==== -Here the `{title_id}` portion represents a _path parameter_. https://stackoverflow.com/questions/30967822/when-do-i-use-path-params-vs-query-params-in-a-restful-api[Here] is a good discussion on when you should choose to design your API with a path parameter vs. a query parameter. The top answer is really good. - -To be very clear, the following would be an example making a request to the `/movies/{title_id}` endpoint. - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/movies/tt0076759") -print(response.json()) ----- -==== - -The following are a list of endpoints we want _you_ to build! - -- `http://localhost:1728/cast/{title_id}` -- `http://localhost:1728/tv/{title_id}` -- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes/{episode_number}` -- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes` (optional) - -The following are a list of endpoints that we will provide you in project 10. - -- `http://localhost:1728/tv/{title_id}/seasons` -- `http://localhost:1728/tv/{title_id}/seasons/{season_number}` - -This will be a very guided project, so please be sure to read the instructions carefully, and as you are working, use your imagination to imagine what other cool potential and possibilities building APIs can have! We are only scratching the surface here! - -Okay, let's get started with the first endpoint. - -- `http://localhost:1728/cast/{title_id}` - -Implement this endpoint. What files do you need to modify? - -- Add the following function to `main.py` -+ -[source,python] ----- -@app.get( - "/cast/{title_id}", - response_model=list[CrewMember], - summary="Get the crew for a title_id.", - response_description="A crew." -) -async def get_cast(title_id: str): - cast = get_cast_for_title(title_id) - return cast ----- -+ -- Add the following query to `queries.sql`, filling in the query -+ ----- --- name: get_cast_for_title --- Get the cast for a given title -SELECT statement here ----- -+ -[IMPORTANT] -==== -Make sure you don't add the carrot "^" to the end of this particular query. Otherwise, it will only return 1 result. -==== -+ -[TIP] -==== -In your `queries.sql` file, anything starting with a colon is a placeholder for a variable you will pass along. Check out the `imdb.py` file and the `queries.sql` file to better understand. -==== -+ -- In your `imdb.py` mondule, fill out the skeleton function called `get_cast_for_title`, that returns a list of `CrewMember` objects. -+ -[TIP] -==== -Here is the function you can finish writing: - -[source,python] ----- -def get_cast_for_title(title_id: str) -> list[CrewMember]: - # Get the cast for the movie, and close the database connection - conn = sqlite3.connect(database_path) - results = queries.get_cast_for_title(conn, title_id = title_id) - conn.close() - - # Create a list of dictionaries, where each dictionary is a cast member - # INITIALIZE EMPTY LIST - for member in results: - crewmemberobject = CrewMember(**{key: member[i] for i, key in enumerate(CrewMember.__fields__.keys())}) - # APPEND crewmemberobject TO LIST - - return cast ----- -==== -+ -[TIP] -==== -Check out the `get_movie_with_id` function for help! It should just be a few small modifications. -==== - -To test your endpoint, run the following in a Python cell in your notebook. - -[source,python] ----- -import requests -resp = requests.get("http://localhost:1728/cast/tt0076759") -print(resp.json()) ----- - -.Output ----- -[{'title_id': 'tt0076759', 'person_id': 'nm0000027', 'category': 'actor', 'job': None, 'characters': '["Ben Obi-Wan Kenobi"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000148', 'category': 'actor', 'job': None, 'characters': '["Han Solo"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000184', 'category': 'director', 'job': None, 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0000402', 'category': 'actress', 'job': None, 'characters': '["Princess Leia Organa"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0000434', 'category': 'actor', 'job': None, 'characters': '["Luke Skywalker"]'}, {'title_id': 'tt0076759', 'person_id': 'nm0002354', 'category': 'composer', 'job': None, 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0156816', 'category': 'editor', 'job': 'film editor', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0476030', 'category': 'producer', 'job': 'producer', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0564768', 'category': 'producer', 'job': 'producer', 'characters': '\\N'}, {'title_id': 'tt0076759', 'person_id': 'nm0852405', 'category': 'cinematographer', 'job': 'director of photography', 'characters': '\\N'}] ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Implement the following endpoint. - -- `http://localhost:1728/tv/{title_id}` - -For this question, we will leave it up to you to figure out what files to modify in what ways. - -[TIP] -==== -Check out the functions that are already implemented for help -- it will be _very_ similar! If you get an error at any step of the way, _read_ the errors -- they tell you what is missing 90% of the time -- or at least hint at it! - -We've provided you with skeleton functions (with comments) in `imdb.py` that you can use to get started (just fill them in). -==== - -[NOTE] -==== -One of the cool things that make APIs so useful is how easy it is to share data in a structured way with others! While there is typically a bit more setup to expose the API to the public -- it is really easy to share with other people on the same system. If you and your friend were on the same node, for example, `brown-a013`, your friend could make calls to your API too! -==== - -To test your endpoint, run the following in a Python cell in your notebook. - -[source,python] ----- -import requests -resp = requests.get("http://localhost:1728/tv/tt5180504") -print(resp.json()) ----- - -Which, should return the following: - -.Output ----- -{'title_id': 'tt5180504', 'type': 'tvSeries', 'primary_title': 'The Witcher', 'original_title': 'The Witcher', 'is_adult': False, 'premiered': 2019, 'ended': None, 'runtime_minutes': 60, 'genres': [{'genre': 'Action'}, {'genre': 'Adventure'}, {'genre': 'Fantasy'}]} ----- - -And also test with the following: - -[source,python] ----- -import requests -resp = requests.get("http://localhost:1728/tv/tt2953050") -print(resp.json()) ----- - -Which, should return the following: - -.Output ----- -{'detail': "Title with title_id 'tt2953050' is not a tv series, it is a movie."} ----- - -Similarly: - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt8343770") -print(response.json()) ----- - -Which, should return the following: - -.Output ----- -{'detail': "Title with title_id 'tt8343770' is not a tv series, it is a tvEpisode."} ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Implement the following endpoint. - -- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes/{episode_number}` - -Okay, don't be overwhelmed! There are only 3 files to modify and add code to: `main.py`, `queries.sql`, and `imdb.py`. Aside from that, you are just making `requests` library calls to test out the API (from within your notebook). - -[TIP] -==== -We've provided you with the following queries (in `queries.sql`): - ----- --- name: get_title_type$ --- Get the type of title, movie, tvSeries, etc. - --- name: get_seasons_in_show$ --- Get the number of seasons in a show - --- name: get_episodes_in_season$ --- Get the number of episodes in a season for a given title with given title_id - --- name: get_episode_for_title_season_number_episode_number^ --- Get the episode title info for the title_id, season number and episode number ----- - -[TIP] -==== -- Use the `get_title_type` query to check if the type is not `tvSeries`. -- Use the `get_seasons_in_show` query to check if the provided `season_number` is valid. For example it must be a positive number and less than or equal to the number of seasons actually in the given show. -- Use the `get_episodes_in_season` query to check if the provided `episode_number` is valid. For example it must be a positive number and less than or equal to the number of episodes actually in the given season. -==== - -All of these queries should be called in your `get_show_for_title_season_and_episode` function in `imdb.py`. We've provided you with skeleton code with comments to help -- just fill it in! - -Finally, you should make a `get_episode` function in `main.py`, with the following signature: - -[source,python] ----- -async def get_episode(title_id: str, season_number: int, episode_number: int): ----- -==== - -To test your endpoint, run the following in cells in your notebook. - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes/2") -print(response.json()) ----- - -.Output ----- -{'title_id': 'tt1664529', 'type': 'tvEpisode', 'primary_title': 'The Blind Banker', 'original_title': 'The Blind Banker', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]} ----- - -Also: - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1664529/seasons/1/episodes/2") -print(response.json()) ----- - -.Output ----- -{'detail': "Title with title_id 'tt1664529' is not a tv series, it is a tvEpisode."} ----- - -Also: - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes/7") -print(response.json()) ----- - -And because there is no episode 7: - -.Output ----- -{'detail': 'Season 1 only 4 episodes and you requested episode 7.'} ----- - -Also: - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1475582/seasons/5/episodes/7") -print(response.json()) ----- - -And because there is no season 5: - -.Output ----- -{'detail': 'There are only 4 seasons for this show, you requested information about season 5.'} ----- - -[NOTE] -==== -Note that this error takes precedence over the fact that there are only 4 episodes and we requested info for episode 7. -==== - -[WARNING] -==== -For this project you should submit the following files: - -- `firstname-lastname-project09.ipynb` with output from making the requests to your API. -- `main.py` -- `queries.sql` -- `imdb.py` -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (Optional, 0 pts) - -Implement the following endpoint. - -- `http://localhost:1728/tv/{title_id}/seasons/{season_number}/episodes` - -To test your endpoint, run the following in a Python cell in your notebook. - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes") -print(response.json()) ----- - -.Output ----- -[{'title_id': 'tt1664529', 'type': 'tvEpisode', 'primary_title': 'The Blind Banker', 'original_title': 'The Blind Banker', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1664530', 'type': 'tvEpisode', 'primary_title': 'The Great Game', 'original_title': 'The Great Game', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 89, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1665071', 'type': 'tvEpisode', 'primary_title': 'A Study in Pink', 'original_title': 'A Study in Pink', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 88, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}, {'title_id': 'tt1815240', 'type': 'tvEpisode', 'primary_title': 'Unaired Pilot', 'original_title': 'Unaired Pilot', 'is_adult': False, 'premiered': 2010, 'ended': None, 'runtime_minutes': 55, 'genres': [{'genre': 'Crime'}, {'genre': 'Drama'}, {'genre': 'Mystery'}]}] ----- - -And of course, continue to have the regular errors we've had so far: - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1475582/seasons/5/episodes") -print(response.json()) ----- - -.Output ----- -{'detail': 'There are only 4 seasons for this show, you requested information about season 5.'} ----- - -And - -[source,python] ----- -import requests - -response = requests.get("http://localhost:1728/tv/tt1664529/seasons/5/episodes") -print(response.json()) ----- - -.Output ----- -{'detail': "Title with title_id 'tt1664529' is not a tv series, it is a tvEpisode."} ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc deleted file mode 100644 index 44a023360..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project10.adoc +++ /dev/null @@ -1,350 +0,0 @@ -= STAT 39000: Project 10 -- Fall 2021 - -**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs. - -**Context:** This is the third in a series of 4 projects focused around APIs. We will learn some basics about interacting and using APIs, and even build our own API. - -**Scope:** Python, APIs, requests, fastapi - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -== Questions - -=== Question 1 - -Begin this project by cloning our repo and installing the required packages. To do so, run the following in a bash cell. - -[IMPORTANT] -==== -This repository -- TheDataMine/f2021-stat39000-project10 -- is a refreshed version of project (9). We've added some more functionality, but that is about it. Since it contains the solutions to project (9), it will be released sometime on Saturday, November 6th, and at the latest, on Monday, November 8th. - -Until that time, you are more than welcome to use the solutions to your project (9) as a starting point for this project. -==== - -[source,ipython] ----- -%%bash - -cd $HOME -git clone git@github.com:TheDataMine/f2021-stat39000-project10.git ----- - -Then, to install the required packages, run the following in a bash cell. - -[source,ipython] ----- -%%bash - -module unload python/f2021-s2022-py3.9.6 -cd $HOME/f2021-stat39000-project10 -poetry install ----- - -Next, let's identify a port that we can run our API on. In a bash cell, run the following. - -[source,ipython] ----- -%%bash - -port ----- - -You will get a port number, like the following, for example. - -.Output ----- -1728 ----- - -From this point on, when we mention the port 1728, please replace it with the port number you were assigned. Open a new terminal tab so that we can run our API, alongside our notebook. - -Next, you'll need to add a `.env` file to your `f2021-stat39000-project10` directory, with the following content. (Pretty much just like the previous project!) - ----- -DATABASE_PATH=/depot/datamine/data/movies_and_tv/imdb.db ----- - -In your terminal, run the following. - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load poetry/1.1.10 -cd $HOME/f2021-stat39000-project10 -poetry run uvicorn app.main:app --reload --port 1728 ----- - -Upon success, you should see some output similar to: - -.Output ----- -INFO: Will watch for changes in these directories: ['$HOME/f2021-stat39000-project9'] -INFO: Uvicorn running on http://127.0.0.1:1728 (Press CTRL+C to quit) -INFO: Started reloader process [25005] using watchgod -INFO: Started server process [25008] -INFO: Waiting for application startup. -INFO: Application startup complete. ----- - -Fantastic! Leave that running in your terminal, and test it out with the following request in a regular Python cell in your notebook. - -[source,python] ----- -import requests -my_headers = {'accept': 'application/json'} -resp = requests.get("http://localhost:1728", headers=my_headers) -print(resp.json()) ----- - -You should receive a Hello World message, great! - -[TIP] -==== -Throughout this project, be patient waiting for your requests to complete -- sometimes they take a while. If it is taking too long, you can always try killing the server. To do so, open the terminal tab and hold ctrl and press c. This will kill the server. Once killed, just restart it using the same command you used previously to start it. - -Finally, there are now 2 places to check for errors and print statements: the terminal and the notebook. When you get an error be sure to check both for useful clues! -==== - -[TIP] -==== -Please test the requests in your notebook with the code we provide you. We've tested them and know that they work. If you choose to test them with a different movie/tv show/etc., you could get unexpected errors related to our `schemas.py` file -- best just to stick to the requests we provide. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -So you've written an API, now what? Well, while an API can have a variety of uses, one of the most common uses is as a _backend_ for a web application. Modern websites typically have a _frontend_ and _backend_. The frontend makes _requests_ to the backend, and the backend responds with _data_ to the frontend. The frontend then displays the data. This architecture makes it easy for developers to work independently on frontend things and backend things without have to understand every detail of the other "side" of the application. - -While frequently some sort of javascript framework is used for a frontend (things like reactjs, vuejs, angularjs, etc.), we can use Python and fastapi to create a super simple frontend! - -To get started, let's define something (just for clarity, these aren't real terms). Let's call a _backend_ request a request made with the `requests` package. This would be any request where we want the JSON formatted data as our response. Let's call a _frontend_ request a request made by a browser, or something similar. This would be any request where we want to use the data, but maybe display it using HTML, instead of JSON. - -The following is an example of a _backend_ request. - -[source,python] ----- -import requests -my_headers = {'accept': 'application/json'} -resp = requests.get("http://localhost:1728", headers=my_headers) -print(resp.json()) ----- - -.Output ----- -{'hello_item': 'hello', 'world_item': 'world'} ----- - -The following is an example of a _frontend_ request. - -[source,python] ----- -from IPython.core.display import display, HTML -my_headers = {'accept': 'application/html'} -resp = requests.get("http://localhost:1728", headers=my_headers) -display(HTML(resp.text)) ----- - -Where the output will be formatted HTML -- just like you'd see in a browser. - -[NOTE] -==== -We _wanted_ you to be able to just type the URLs in a browser to see the results of our frontend requests, but unfortunately, this is the best we can do for now. We are emulating a frontend request by setting the accept head to `application/html`. This is a bit of a hack, but it works. -==== - -Okay, now, maybe you are asking yourself -- but the two requests have the same url, `http://localhost:1728`, why don't we get the same response for both? - -The answer is that we are using the `accept` header to try and determine if the request is being made from a browser, or from something like the `requests` package. Check out the `root` function in the `main.py` module. - -We first get the header from the `request` object: - -[source,python] ----- -accept = request.headers.get("accept") ----- - -If the header is `application/json`, then we know that the user wants to have JSON output, not HTML. If the header is `application/html`, or if the header has multiple values separated by commas, then we assume that the user is a browser or someone making a frontend request. - -Why is any of this important? Well, wouldn't it be cool if we could type: `http://localhost:1728/movies/tt0076759` into a browser and get our data formatted into a webpage? But then, at the same time, use the exact same endpoint to get the data formatted as JSON, in case we wanted to use the API with some program we are writing? Thats what this trick allows us to do! - -[IMPORTANT] -==== -For this question, make sure to just run the "frontend" and "backend" requests in your notebook (provided above). Other than that, just try and do your best to understand what is happening in the `root` function. That's it! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The goal of this question (and the following questions) use our templating engine/Python package called `jinja2` to render webpages for the requests we built in the previous project. To get you started, we've provided HTML templates in the `templates` directory. These templates currently just contain boilerplate HTML structure that you will add to so our data is rendered neatly(ish). - -[IMPORTANT] -==== -At this point in time you are probably feeling overwhelmed and not understanding what is going on -- that is okay, it will start to make more sense as you mess around with things. If it is any consolation -- you will **not** be writing _any_ Python code today! You'll just be using the `jinja2` package within our HTML templates. There is a small learning curve, but I will provide examples with the questions, so you can see the syntax. -==== - -Let's start with the following webpage: - -- `http://localhost:1728/movies/{title_id}` - -To make the "frontend" request, run the following in a cell. - -[source,python] ----- -from IPython.core.display import display, HTML -my_headers = {'accept': 'application/html'} -resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers) -display(HTML(resp.text)) ----- - -We've set the template up to provide you with an example of a loop (see the genres section in `movie.html`), and some examples of simple data access. There are some missing pieces of information we want you to add (information in the "Facts:" section)! Please add the missing fields to the HTML template, and make a new frontend request. The results should look like the following: - -image::figure26.webp[Expected output for question 3, width=792, height=500, loading=lazy, title="Expected output for question 3"] - -To remind yourself what the JSON response for this request looks like run the following in a cell. - -[source,python] ----- -import requests -my_headers = {'accept': 'application/json'} -resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers) -print(resp.json()) ----- - -We pass the entire `Movie` object to `jinja2`, so everything you see in the JSON response, we can access and embed in the HTML template. Notice in the `main.py` file how we are returning a single `Title` object. If you look in `schemas.py`, you can see all of the attributes of the `Title` object that you can access using dot notation. The variable itself, is named `movie` since the object we return in the `get_movies` function in `main.py` is named `movie`. So, in our template, we can access the primary title, for example using `movie.primary_title`. We can also access any other variable that exists in the `Title` class shown in `schemas.py` in the same way! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Let's say that we only like movies that premiered after 1990 (inclusive). Any other movie, we want to make the `h1` header bright red for "not going to watch _that_". Could we do that? Yes! - -[TIP] -==== -To change the text color of an `h1` element, see https://www.w3schools.com/html/html_styles.asp[this link]. -==== - -Update the `movie.html` template to do this. Check out the examples https://jinja.palletsprojects.com/en/2.10.x/templates/#if[here]. - -To test your work, run the following two chunks of code. The first should display in red, the second should not. - -[source,python] ----- -from IPython.core.display import display, HTML -my_headers = {'accept': 'application/html'} -resp = requests.get("http://localhost:1728/movies/tt0076759", headers=my_headers) -display(HTML(resp.text)) ----- - -[source,python] ----- -from IPython.core.display import display, HTML -my_headers = {'accept': 'application/html'} -resp = requests.get("http://localhost:1728/movies/tt7401588", headers=my_headers) -display(HTML(resp.text)) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Okay, great! Now we have a cool page for any movie we want to look up. Read about HTML tables https://www.w3schools.com/html/html_tables.asp[here]. - -Modify the `episodes.html` template in the `templates` directory to display the following information in a neatly formatted table _with_ a header row: `title_id`, `primary_title`, `is_adult`, `premiered`, and `runtime_minutes`. - -Rather than displaying `True` or `False` for the `is_adult` field, instead display the text `Yes` or `No`. - -[TIP] -==== -Use conditionals in `jinja2` to display the text `Yes` or `No` for the `is_adult` field. -==== - -[TIP] -==== -Check out the `get_episodes` function in `main.py` to see how we are returning a list of `Title` objects (that represent episodes). Note that the _name_ of the variable sent to the template is `episodes`, which is a _list_ of episodes. Use the name `episodes` in your template to access the data. -==== - -[TIP] -==== -Remember, while working in your template, `episodes.html`, you can access the _list_ of `Title` objects using the name `episodes`. With that being said, **be careful** -- you don't want to try `episodes.primary_title` or `episodes.is_adult`, because that will try to access the `primary_title` and `is_adult` fields of the `Title` object, which you don't want to do, because `episodes` is a **list** of `Title` objects, not a single `Title` object. - -Therefore, you should use a loop to access each individual `Title` object in the `episodes` list. -==== - -To take a look at the list of `Title` objects returned by the `get_episodes` function, in JSON format, run the following in a cell. - -[source,python] ----- -import requests -my_headers = {'accept': 'application/json'} -resp = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes", headers=my_headers) -print(resp.json()) ----- - -To test your work, run the following in a cell. - -[source,python] ----- -from IPython.core.display import display, HTML -my_headers = {'accept': 'application/html'} -resp = requests.get("http://localhost:1728/tv/tt1475582/seasons/1/episodes", headers=my_headers) -display(HTML(resp.text)) ----- - -The output should look like the following: - -image::figure27.webp[Expected results question 5, width=792, height=500, loading=lazy, title="Expected results question 5"] - -[WARNING] -==== -For this project you should submit the following files: - -- `firstname-lastname-project10.ipynb` with output from making the requests to your API. -- `movie.html` -- `episodes.html` -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc deleted file mode 100644 index 25a89bdc7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project11.adoc +++ /dev/null @@ -1,112 +0,0 @@ -= STAT 39000: Project 11-- Fall 2021 - -**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. We will address some pros and cons of each, with a focus on the most ubiquitous, RESTful APIs. - -**Context:** This is the fourth in a series of 4 projects focused around APIs. At this point in time there will be varying levels of understanding of APIs, how to use them, and how to write them. One of the "coolest" parts about APIs is how flexible they are. It is kind of like a website, the limitations are close to what you can imagine. Every once in a while we like to write projects that are open ended and allow you to do whatever you want within certain guidelines. This will be such a project. - -**Scope:** Python, APIs, requests, fastapi - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/**` - -You are free to use any dataset(s) you wish for this project. The only requirement is that there is _some_ data-oriented component to the API you build, and that there is a way for anyone (in the course) to access the data. So easily downloadable datasets, datasets in the data depot, web scraping, etc., are all acceptable. - -== Questions - -=== Overview - -At a high level, this project has 3 parts. - -. Write an API that does _something_ with data. -. Provide a series of images, or a video/screen recording, that demonstrates what your API does. -. If, you chose to provide a series of images, provide text that explains what the images are showing, and how your API behaves. If you chose to provide a video/screen recording, a verbal explanation can be used in lieu of text. - - -If you choose to illustrate your API with **images**. - -.Items to submit -==== -- A jupyter notebook with images and explanations of what you are showing in the images, and what your API is doing. Feel free to write about any struggles you ran into, how you fixed them (or why you couldn't figure out _how_ to fix them -- that is perfectly OK if that happens!). -==== - -If you choose to illustrate your API with **a video**. - -.Items to submit -==== -- A video showing off your API and explaining what it does. Feel free to write about any struggles you ran into, how you fixed them (or why you couldn't figure out _how_ to fix them -- that is perfectly OK if that happens!). -- If your video doesn't contain audio, include an `explain.txt` file with a written explanation of what your video is showing. -==== - -=== Part 1 - -_Write an API that does **something** with data._ - -**Requirements:** - -. The _something_ your API does must be non-trivial. In other words, don't _just_ regurgitate the data from the dataset. Wrangle the data, recombine it in a useful way, transform it into a graphic, summarize it, etc. -. Just put 1 project's worth of _effort_ into your API. This will vary from student to student, but just show us some effort. We aren't looking for APIs that are perfect, or (anywhere near) as complicated as the previous projects you've worked on -- they can be _much_ simpler -- especially since you are putting it together from (basically) scratch! - -The open-ended nature of this project may frustrate some of you, so we will provide some ideas below that would be accepted for full credit. - -- Build on and add new features to an API from a previous project. -- Use a feature of fastapi that you haven't seen before. For example, something like https://github.com/TheDataMine/fastapidemo[this] would be _more_ than enough. (Building on that demo is perfectly acceptable to do for this project too.) Other ideas could be using websockets (using fastapi), graphQL (using fastapi), a form that does something when you submit it, etc (these are all _way_ more than we expect from you). -- Incorporate other skills you've learned previously (like scraping data, for instance) into your API. -- You could write an API that scrapes the-examples-book.com and gives you the link to the newest 190/290/390 project (or something like that). -- You could write a https://fastapi.tiangolo.com/tutorial/middleware/[middleware] that does something with the request and response for one of our previous APIs. -- You could write an API that scrapes data from https://purdue.edu/directory and returns something. -- You could write an API that returns a random "The Office" quote using a dataset in the data depot (this is an example of about the minimum we would expect from your API). - -Have fun, be creative, and know that we understand it is a stressful time and we will be lenient and forgiving with grading. This is about trying something new and maybe having some fun and incorporating your own interests into the project. _Please_ feel 100% free to use any of the previous projects as a starting point for your code -- we will _not_ consider that "copying" at all. - -=== Part 2 - -_Provide a series of images, or a video/screen recording, that demonstrates what your API does._ - -If you choose to use images. Submit a Jupyter notebook with images, followed by text, explaining what is in the images. As a reminder, you can insert an image using markdown, as follows. - -[source,ipython] ----- -%%markdown - -![](/absolute/path/to/image.png) ----- - -Again, this doesn't need to be perfect, just add enough details so we can get a good idea of what you created. - -If you choose to do a screen recording, please add voiceover so you can explain what you are doing while you are doing it. Alternatively, feel free to have a silent video, but please also submit a `explain.txt` file with a verbal explanation of what your API does. - -The final requirement for **both** the video and image choices are to include a portion where you dig into a critical piece of your code and explain what it does. This is just so we see some of your code and show us you understand it. - -[TIP] -==== -On a mac, an easy way to take a screen recording is to type kbd:[Ctrl + Cmd + 5], and then click on the record screen button option on the lower part of your screen. When you want to stop recording push the stop button in the menubar at the top of your screen where the date and time is shown. -==== - -[TIP] -==== -On a windows machine, https://www.laptopmag.com/articles/how-to-video-screen-capture-windows-10[here] are some directions. -==== - -=== Part 3 - -_If, you chose to provide a series of images, provide text that explains what the images are showing, and how your API behaves. If you chose to provide a video/screen recording, a verbal explanation can be used in lieu of text._ - -This was explained in part (2), however, we are reiterating it here. - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc deleted file mode 100644 index 8f0fd73a3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project12.adoc +++ /dev/null @@ -1,351 +0,0 @@ -= STAT 39000: Project 12 -- Fall 2021 - -**Motivation:** Containers are a modern solution to packaging and shipping some sort of code in a reproducible and portable way. When dealing with R and Python code in industry, it is highly likely that you will eventually have a need to work with Docker, or some other container-based solution. It is best to learn the basics so the basic concepts aren't completely foreign to you. - -**Context:** This is the first project in a 2 project series where we learn about containers, and one of the most popular container-based solutions, Docker. - -**Scope:** Docker, unix, Python - -.Learning Objectives -**** -- Understand the various components involved with containers: Dockerfile/build file, container image, container registry, etc. -- Understand how to push and pull images to and from a container registry. -- Understand the basic Dockerfile instructions. -- Understand how to build a container image. -- Understand how to run a container image. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -First thing first. Please read https://www.padok.fr/en/blog/container-docker-oci?utm_source=pocket_mylist[this fantastic article] for a great introduction to containers. Afterwards, please review the content we have available xref:containers:index.adoc[here]. - -In this project, we have a special challenge. Brown does _not_ have Docker installed. This is due to a variety of reasons. Brown _does_ have a tool called Singularity installed, however, it is different enough from more common containerization tools, that it does not make sense to learn for your first "container" experience. - -To solve this issue, we've created a virtual machine that runs Ubuntu, and has Docker pre-installed and configured for you to use. To be clear, the majority of this project will revolve around the command line from within Jupyter Lab. We will specifically state the "deliverables" which will mainly be text or images that are copied and pasted in Markdown cells. - -Please login and launch a Jupyter Lab session. Create a new notebook to put your solutions, and open up a terminal window beside your notebook. - -In your terminal, navigate to `/depot/datamine/apps/qemu/scripts/`. You should find 4 scripts. They perform the following operations, respectively. - -. Copies our VM image from `/depot/datamine/apps/qemu/images/` to `/scratch/brown/$USER/`, so you each get to work on your _own_ (virtual) machine. -. Creates a SLURM job and provides you a shell to that job. The job will last 4 hours, provide you with 4 cores, and will have ~6GB of RAM. -. Runs the virtual machine in the background, in your SLURM job. -. SSH's into the virtual machine. - -Run the scripts in your Terminal, in order, from 1-4. - -[source,bash] ----- -cd /depot/datamine/apps/qemu/scripts/ -./1_copy_vm.sh ----- - -[source,bash] ----- -./2_grab_a_node.sh ----- - -[source,bash] ----- -./3_run_a_vm.sh ----- - -[IMPORTANT] -==== -You may need to press enter to free up the command line. -==== - -[source,bash] ----- -./4_connect_to_vm.sh ----- - -[IMPORTANT] -==== -You will eventually be asked for a password. Enter `thedatamine`. -==== - -[NOTE] -==== -Remember, to add an image or screenshot to a markdown cell, you can use the following syntax: - ----- -![](/home/kamstut/my_image.png) ----- -==== - -.Items to submit -==== -- A screenshot of your terminal window after running the 4 scripts. -==== - -=== Question 2 - -Awesome! Your terminal is now connected to an instance of Ubuntu with Docker already installed and configured for you! Now, let's get to work. - -First thing is first. Let's test out _pulling_ an image from the Docker Hub. `wernight/funbox` is a fun image to do some wacky things on a command line. Pull the image (https://hub.docker.com/r/wernight/funbox), and verify that the image is available on your system using `docker images`. - -Run the following to get an ascii aquarium. - -[source,bash] ----- -docker run -it wernight/funbox asciiquarium ----- - -Wow! That is wild! You can run this program on _any_ system where an OCI compliant runtime exists -- very cool! - -To quit the program, press kdb:[Ctrl + c]. - -For this question, submit a screenshot of the running asciiquarium program. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, that was fun, but let's do something a little bit more practical. Check out the `~/projects/whin` directory in your VM. You should pretty quickly realize that this is our version of the WHIN API that we used earlier on in project (8). - -If you recall, we had a lot of "extra" steps we had to take in order to run the API. We had to: - -- Install the Python dependencies. -- Activate the appropriate Python environment. -- Set the `DATABASE_PATH` environment variable. -- Remember some long and complicated command. - -This is a fantastic example of when _containerizing_ your app could be a great idea! - -Let's begin by writing our own Dockerfile. - -First thing is first. We want our image to contain the correct version of Python for our app. Our app requires at least Python 3.9. Let's see if we can find a _base_ image that has Python 3.9 or later. Google "Python docker image" and you will find the following link: https://hub.docker.com/_/python - -Here, we will find a wide variety of different "official" Python docker images. A great place to start. If you click on the "Tags" tab, you will be able to scroll through a wide variety of different versions of Python + operating systems. A great Linux distribution is Debian. - -[NOTE] -==== -Fun fact: Debian/the Debian project (one of the, if not _the_ most popular linux distribution) was founded by a Purdue alum, https://en.wikipedia.org/wiki/Ian_Murdock[Ian Murdock]. -==== - -Okay, let's go for the Python 3.9.9 + Bullseye (Debian) image. The tag for the image is `python:3.9.9-bullseye`. But wait a second. If you look at the space required for the base image -- it is _already_ up to 370 or so MB -- that is quite a bit! Maybe there is a lighter weight option? If you search for "slim" you will find an image with the tag `python:3.9.9-slim-bullseye` that takes up only 45 MB by default -- much better. - -Create a file called `Dockerfile` in the `~/projects/whin` directory. Use vim/emacs/nano to edit the file to look like this: - -.Dockerfile ----- -FROM python:3.9.9-slim-bullseye ----- - -Now, let's build our image. - -[source,bash] ----- -docker build -t whin:0.0.1 . ----- - -Once created, you should be able to view your image by running the following. - -[source,bash] ----- -docker images ----- - -Now, let's run our image. After running `docker images`, if you look under the `IMAGE` column, you should see an id for you image -- something like `3dk35bdl`. To run your image, do the following. - -[source,bash] ----- -docker run -dit 3dk35bdl ----- - -Be sure to replace `3dk35bdl` with the id of your image. Great! Your image should now be running. Find out by running the following. - -[source,bash] ----- -docker ps ----- - -Under the `NAMES` column, you will see the name of your running container -- very cool! How does this test out anything? Don't we want to see if we have Python 3.9 running like we want it to? Yes! Let's get a bash shell _inside_ our container. To do so run the following. - -[source,bash] ----- -docker exec -it suspicious_lumiere /bin/bash ----- - -Replace `suspicious_lumiere` with the name of your container. You should now be in a bash shell. Awesome! Run the following to see what version of Python we have installed. - -[source,bash] ----- -python --version ----- - -.Output ----- -Python 3.9.9 ----- - -Awesome! So far so good! To exit the container, type and run `exit`. Take a screenshot of your terminal after following these steps and add it to your notebook in a markdown cell. - -To clean up and stop the container, run the following. - -[source,bash] ----- -docker stop suspicious_lumiere ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, great! We have version 0.0.1 of our `whin` image. Great. - -Now let's make this thing useful. Use vim/emacs/nano to edit the `~/projects/whin/Dockerfile` to look like this: - -.Dockerfile ----- -FROM python:3.9.9-slim-bullseye - -WORKDIR /app - -RUN python -m pip install fastapi[all] pandas aiosql fastapi-responses cyksuid httpie - -COPY . . - -EXPOSE 21650 - -CMD ["uvicorn", "app.main:app", "--reload", "--port", "21650", "--host", "0.0.0.0"] ----- - -Here, do your best to explain what each line of code does. Build version 0.0.2 of your image, and run it. - -Okay, in theory, that last line _should_ run our API -- awesome! Let's check the logs to see if it is working. - -[source,bash] ----- -docker logs my_container_name ----- - -[TIP] -==== -Remember, to get your container name, run `docker ps` and look under the `NAME` column. -==== - -What you _should_ get is a Python error! Something about NoneType. Whoops! We forgot to include the `DATABASE_PATH` environment variable so our API knows where our WHIN database is. That is critical to our API. - -[TIP] -==== -https://docs.docker.com/engine/reference/builder/#env[This command] will be very useful to achieve this! -==== - -Modify our Dockerfile to include the `DATABASE_PATH` environment variable with a value `/home/tdm-user/projects/whin/whin.db`. Rebuild your image (as version 0.0.2), and run it. Check the logs again, does it appear to be working? - -.Items to submit -==== -- The fixed Dockerfile contents in a markdown cell as code (surrounded by 3 backticks). -- A screenshot (or more) of the terminal output from running the various commands. -==== - -=== Question 5 - -Okay, there is one step left. Let's see if the API is _really_ fully working by making a request to it. First, get a shell to the running container. - -[source,bash] ----- -docker exec -it container_name /bin/bash ----- - -[TIP] -==== -Remember, to get your `container_name` list the running containers using `docker ps`. -==== - -One inside the container, let's make a request to the API that is running. Run the following: - -[source,bash] ----- -python -m httpie localhost:21650 ----- - -If all is well you _should_ get: - -.Output ----- -HTTP/1.1 200 OK -content-length: 25 -content-type: application/json -date: Thu, 18 Nov 2021 20:28:47 GMT -server: uvicorn - -{ - "message": "Hello World" -} ----- - -Awesome! You can see our API is definitely working, cool! - -Okay, one final test. Let's exit the container and make a request to the API again. After all, it wouldn't be that useful if we had to essentially login to a container when we want to access an API running _in_ that container, would it? - -[source,bash] ----- -http localhost:21650 ----- - -Uh oh! Although our API is running smoothly _inside_ of the container, we have no way of accessing it _outside_ of the container. Remember, `EXPOSE` only _signals_ that we _want_ to expose that port, it doesn't actually do that for us. No worries, this can be easily fixed. - -[source,bash] ----- -docker run -dit -p 21650:21650 --name my_container_name 3kdgj024jn ----- - -[TIP] -==== -Here, we named the resulting container `my_container_name`. This is a cool trick if you get tired of running `docker ps` to get the name of a newly running container. -==== - -Where `3kdgj024jn` is the id of your image. Now, let's try and access the API again. - -[source,bash] ----- -http localhost:21650 ----- - -Voila! It works! The following is an equivalent run statement: - -[source,bash] ----- -docker run -dit -p 21650 --name my_container_name 3kdgj024jn ----- - -However, if you want to specify that the API _internally_ is using port 21650, but we want to expose the API running _inside_ our container to _outside_ our container on a different port, say, port 5555, we could run the following. - -[source,bash] ----- -docker run -dit -p 5555:21650 --name my_container_name 3kdgj024jn ----- - -Then, you could access the API by running the following: - -[source,bash] ----- -http localhost:5555 ----- - -While our request goes to port 5555, once the request hits the container, it is routed to port 21650 inside the container, which is where our API is running. This can be confusing a may take some experimentation until you are comfortable with it. - -.Items to submit -==== -- Screenshot(s) showing the input and output from the terminal. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc deleted file mode 100644 index 3d6c69c42..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-project13.adoc +++ /dev/null @@ -1,393 +0,0 @@ -= STAT 39000: Project 13 -- Fall 2021 - -**Motivation:** Containers are a modern solution to packaging and shipping some sort of code in a reproducible and portable way. When dealing with R and Python code in industry, it is highly likely that you will eventually have a need to work with Docker, or some other container-based solution. It is best to learn the basics so the basic concepts aren’t completely foreign to you. - -**Context:** This is the second project in a 2 project series where we learn about containers. - -**Scope:** unix, Docker, Python, R, Singularity - -.Learning Objectives -**** -- Understand the various components involved with containers: Dockerfile/build file, container image, container registry, etc. -- Understand how to push and pull images to and from a container registry. -- Understand the basic Dockerfile instructions. -- Understand how to build a container image. -- Understand how to run a container image. -- Use singularity to run a container image. -- State the primary differences between Docker and Singularity. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Containers solve a real problem. In this project, we are going to demonstrate a real-world example of code that doesn't prove to be portable, and we will _fix_ it using containers. - -Check out the code (questions and solutions) in the https://thedatamine.github.io/the-examples-book/projects.html#p03-290[Fall 2020 STAT 29000 Project 3], and try to run the solution for question (4) in your Jupyter Notebook. You'll quickly notice that the code no longer works, _as-is_. In this case it is (partly) due to incorrect paths for the Firefox executable as well as the Geckodriver executable. These changes occurred because we switched systems from Scholar to Brown. - -_What if_ we could create a container to run this function on any system with a OCI compliant engine and/or runtime? Let's try! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Okay, below is a modified version of the code from the previous question. All we have done is turned it into a script that would be run as follows: - -[source,bash] ----- -python get_price.py zip 47906 ----- - -Okay, here it is: - -.get_price.py -[source,python] ----- -import sys -import re -import os -import time -import argparse - -from selenium import webdriver -from selenium.webdriver.common.keys import Keys -from selenium.webdriver.firefox.options import Options -from selenium.common.exceptions import NoSuchElementException -from selenium.webdriver.common.by import By -from selenium.webdriver.firefox.service import Service - -def avg_house_cost(zip: str) -> float: - firefox_options = Options() - firefox_options.add_argument("window-size=1920,1080") - firefox_options.add_argument("--headless") # Headless mode means no GUI - firefox_options.add_argument("start-maximized") - firefox_options.add_argument("disable-infobars") - firefox_options.add_argument("--disable-extensions") - firefox_options.add_argument("--no-sandbox") - firefox_options.add_argument("--disable-dev-shm-usage") - firefox_options.binary_location = '/class/datamine/apps/firefox/firefox' - - service = Service('/class/datamine/apps/geckodriver', log_path=os.path.devnull) - - driver = webdriver.Firefox(options=firefox_options, service=service) - url = 'https://www.trulia.com/' - driver.get(url) - - search_input = driver.find_element(By.ID, "banner-search") - search_input.send_keys(zip) - search_input.send_keys(Keys.RETURN) - time.sleep(10) - - allbed_button = driver.find_element(By.XPATH, "//button[@data-testid='srp-xxl-bedrooms-filter-button']/ancestor::li") - allbed_button.click() - time.sleep(2) - - bed_button = driver.find_element(By.XPATH, "//button[contains(text(), '3+')]") - bed_button.click() - time.sleep(3) - - price_elements = driver.find_elements(By.XPATH, "(//ul[@data-testid='search-result-list-container'])[1]//div[@data-testid='property-price']") - prices = [int(re.sub("[^0-9]", "", e.text)) for e in price_elements] - - driver.quit() - - return sum(prices)/len(prices) - - -def main(): - parser = argparse.ArgumentParser() - - subparsers = parser.add_subparsers(help="possible commands", dest="command") - - zip_parser = subparsers.add_parser("zip", help="search by zipcode") - zip_parser.add_argument("zip_code", help="the zip code to search for") - - if len(sys.argv) == 1: - parser.print_help() - parser.exit() - - args = parser.parse_args() - - if args.command == "zip": - print(avg_house_cost(f'{args.zip_code}')) - - -if __name__ == '__main__': - main() ----- - -First thing is first, we need to launch and connect to our VM so we can create our Dockerfile and build our container image. - -If you have not already done so, please login and launch a Jupyter Lab session. Create a new notebook to put your solutions, and open up a terminal window beside your notebook. - -In your terminal, navigate to `/depot/datamine/apps/qemu/scripts/`. You should find 4 scripts. They perform the following operations, respectively. - -. Copies our VM image from `/depot/datamine/apps/qemu/images/` to `/scratch/brown/$USER/`, so you each get to work on your _own_ (virtual) machine. -. Creates a SLURM job and provides you a shell to that job. The job will last 4 hours, provide you with 4 cores, and will have ~6GB of RAM. -. Runs the virtual machine in the background, in your SLURM job. -. SSH's into the virtual machine. - -Run the scripts in your Terminal, in order, from 1-4. - -[source,bash] ----- -cd /depot/datamine/apps/qemu/scripts/ -./1_copy_vm.sh ----- - -[source,bash] ----- -./2_grab_a_node.sh ----- - -[source,bash] ----- -./3_run_a_vm.sh ----- - -[IMPORTANT] -==== -You may need to press enter to free up the command line. -==== - -[source,bash] ----- -./4_connect_to_vm.sh ----- - -[IMPORTANT] -==== -You will eventually be asked for a password. Enter `thedatamine`. -==== - -[NOTE] -==== -Remember, to add an image or screenshot to a markdown cell, you can use the following syntax: - ----- -![](/home/kamstut/my_image.png) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Create a new folder in your $HOME directory (_inside_ your VM) called `project13`. Inside the folder, place the `get_price.py` code into a file called `get_price.py`. Give the file execute permissions: - -[source,bash] ----- -chmod +x get_price.py ----- - -Great! Next, create a Dockerfile in the `project13` folder. The following is some _starter_ content for your Dockerfile. - -.Dockerfile ----- -FROM python:3.9.9-slim-bullseye <1> - -RUN apt update && apt install -y wget bzip2 firefox-esr <2> - -<3> - -RUN wget --output-document=geckodriver.tar.gz https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz && \ - tar -xvf geckodriver.tar.gz && \ - rm geckodriver.tar.gz && \ - chmod +x geckodriver <4> - -RUN python -m pip install selenium <5> - -<6> - -<7> - -<8> -<9> ----- - -<1> The first line should look familiar. This is just our base image that has Python3 fully locked and loaded and ready for us to use. - -<2> The second line installed 3 critical packages in our container. The first is `wget`, which we use to download compatible versions of Geckodriver. The second is `bzip2`, which we use to unzip the Geckodriver archives. The third is firefox, which is installed to `/usr/bin/firefox`. - -<3> Here, I want you to change the work directory to `/vendor`, so our Geckodriver binary lives directly in `/vendor/geckodriver`. - -<4> The next line downloads the Geckodriver program, and extracts it. - -<5> This line installed the `selenium` Python package which is needed for our `get_price.py` script. - -<6> Here, I want you to change the work directory to `/workspace` -- this way our `get_price.py` script will be copied in the `/workspace` directory. - -<7> Copy the `get_price.py` code into the `/workspace` directory. -+ -[CAUTION] -==== -You _may_ want to modify the script! There are two locations in the script: `/class/datamine/apps/firefox/firefox` as well as `/class/datamine/apps/geckodriver`. These _should_ be the location of the firefox executable and the geckodriver executable. Inside our container, however, these locations will be different! You will need to change the `/class/datamine/apps/firefox/firefox` to the location of the firefox executable, `/usr/bin/firefox`. You will need to change the `/class/datamine/apps/geckodriver` to the location of the geckodriver executable, `/vendor/geckodriver`. -==== -+ -<8> Here, I want you to use the `ENTRYPOINT` command to place the commands that you _always_ want to run. -+ -[TIP] -==== -It will be 3 of the 4 of the following (in quotes in the right format): - ----- -python get_price.py zip 47906 ----- -==== -+ -<9> Here, I want you to use the `CMD` command to place a default zip code to search for. The `CMD` command will get overwritten by commands you enter in the terminal. -+ -[TIP] -==== -For example: - ----- -CMD ["47906"] ----- -==== - -The combination of (8) and (9) allow for the following functionality. - -[source,bash] ----- -docker run ABC123XYZ ----- - -.Output ----- -319876.0 # default price for 47906 (our default zip passed in (9)) ----- - -Or, if you want to search for a zip code that is _not_ the default zip code (47906 in my example). - -[source,bash] ----- -docker run ABC123XYZ 63026 ----- - -.Output ----- -498393.15 # price for 63026 ----- - -Very cool! - -Okay, lets build your image. - -[source,bash] ----- -docker build -t pricer:latest . ----- - -Upon success, you should be able to run the following to get the image id. - -[source,bash] ----- -docker inspect pricer:latest --format '{{ .ID }}' ----- - -.Output ----- -sha256:skjdbgf02u4ntb2j4tn ----- - -Then to test your image, run the following: - -[source,bash] ----- -docker run skjdbgf02u4ntb2j4tn ----- - -[IMPORTANT] -==== -Here, replace skjdbgf02u4ntb2j4tn with _your_ image id. -==== - -Then, to test a different, non-default zip code, run the following: - -[source,bash] ----- -docker run skjdbgf02u4ntb2j4tn 63026 ----- - -[IMPORTANT] -==== -Make sure 63026 is a zip code that is different from your default zip code. -==== - -Awesome job! Okay, now, take some screenshots of all your hard work, and add them to your Jupyter Notebook in a markdown cell. Please also include your Dockerfile contents. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -[IMPORTANT] -==== -You do _not_ need to complete the previous questions to complete this one. -==== - -So all the talk about portability, yet we've been working on the same VM. Well, let's use Singularity on Brown to run our code! - -[NOTE] -==== -Singularity is a tool _similar_ to Docker, but different in many ways. The important thing to realize here is that since we have a OCI compliant image publicly available, we can use Singularity to run our code. Otherwise, it is safe to just think of this as a different "docker" that works on Brown (for now). -==== - -First step is to exit your VM if you have not already. Just run `exit`. - -Then, while in Brown, _pull_ our image. We've uploaded a correct version of the image for anyone to use. To pull the image using Singularity, run the following command. - -[source,bash] ----- -cd $HOME -singularity pull docker://kevinamstutz/pricer:latest ----- - -This may take a couple minutes to run. Once complete, you will see a SIF file in your $HOME directory called `pricer_latest.sif`. Think of this file as your container, but rather than accessing it using an engine (for example with `docker images`), you have a file. - -Then, to run the image, run the following command. - -[source,bash] ----- -cd $HOME -singularity run --cleanenv --pwd '/workspace/' pricer_latest.sif ----- - -[NOTE] -==== -You may notice the extra argument `--cleanenv`. This is to prevent environment variables on Brown from leaking into our container. In a lot of ways it doesn't make much sense why this wouldn't be a default. - -In addition, the `WORKDIR` command is not respected by Singularity. This feature makes sense due to some core differences in design, however, it _does_ make it marginally more difficult to use images built using Docker, and as a result makes it less reliable to simply pull and image and run it. This is what the `--pwd '/workspace/'` argument is for. With that being said, if you don't already _know_ the location from which the container expects to run, this can lead to more work. -==== - -Then, to give it a non-default zip code, run the following command. - -[source,bash] ----- -singularity run --cleanenv --pwd '/workspace/' pricer_latest.sif 33004 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc deleted file mode 100644 index 60d4d5861..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/39000/39000-f2021-projects.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= STAT 39000 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -* xref:fall2021/39000/39000-f2021-officehours.adoc[STAT 39000 Office Hours for Fall 2021] -* xref:fall2021/39000/39000-f2021-project01.adoc[Project 1: Review: Gettings started with Jupyter Lab] -* xref:fall2021/39000/39000-f2021-project02.adoc[Project 2: Python documentation: part I] -* xref:fall2021/39000/39000-f2021-project03.adoc[Project 3: Python documentation: part II] -* xref:fall2021/39000/39000-f2021-project04.adoc[Project 4: Testing in Python: part I] -* xref:fall2021/39000/39000-f2021-project05.adoc[Project 5: Testing in Python: part II] -* xref:fall2021/39000/39000-f2021-project06.adoc[Project 6: Virtual environments, git, & sharing Python code: part I] -* xref:fall2021/39000/39000-f2021-project07.adoc[Project 7: Virtual environments, git, & sharing Python code: part II] -* xref:fall2021/39000/39000-f2021-project08.adoc[Project 8: Virtual environments, git, & sharing Python code: part III & APIs: part I] -* xref:fall2021/39000/39000-f2021-project09.adoc[Project 9: APIs: part II] -* xref:fall2021/39000/39000-f2021-project10.adoc[Project 10: APIs: part III] -* xref:fall2021/39000/39000-f2021-project11.adoc[Project 11: APIs: part IV] -* xref:fall2021/39000/39000-f2021-project12.adoc[Project 12: Containerization: part I] -* xref:fall2021/39000/39000-f2021-project13.adoc[Project 13: Containerization: part II] - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. After submitting your project in Gradescope, you will be able to download the project to verify that the content you submitted is what the graders will see. You will **not** get credit for or be able to re-submit your work if you accidentally uploaded the wrong project, or anything else. It is your responsibility to ensure that you are uploading the correct content. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat39000 - -=== Link - -https://piazza.com/purdue/fall2021/stat39000/home - -== Syllabus - -++++ -include::book:ROOT:partial$syllabus.adoc[] -++++ - -== Office hour schedule - -++++ -include::book:ROOT:partial$office-hour-schedule.adoc[] -++++ \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc deleted file mode 100644 index 76ccbcd40..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/19000-f2021-officehours.adoc +++ /dev/null @@ -1,463 +0,0 @@ -= STAT 19000 Office Hours for Fall 2021 - -It might be helpful to also have the office hours for STAT 29000 and STAT 39000: - -xref:logistics/29000-f2021-officehours.adoc[STAT 29000 Office Hours for Fall 2021] - -xref:logistics/39000-f2021-officehours.adoc[STAT 39000 Office Hours for Fall 2021] - -and it might be helpful to look at the -xref:logistics/officehours.adoc[general office hours policies]. - -The STAT 19000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Melissa Cai Shi -|19000 -|https://purdue.webex.com/meet/mcaishi - -|Shreyas Chickerur -|19000 -|https://purdue-student.webex.com/meet/schicker - -|Nihar Chintamaneni -|19000 -|https://purdue-student.webex.com/meet/chintamn - -|Sumeeth Guda -|19000 -|https://purdue-student.webex.com/meet/sguda - -|Jonah Hu -|19000 -|https://purdue-student.webex.com/meet/hu625 - -|Darren Iyer -|19000 -|https://purdue-student.webex.com/meet/iyerd - -|Pramey Kabra -|19000 -|https://purdue-student.webex.com/meet/kabrap - -|Ishika Kamchetty -|19000 -|https://purdue-student.webex.com/meet/ikamchet - -|Jackson Karshen -|19000 -|https://purdue-student.webex.com/meet/jkarshe - -|Bhargavi Katuru -|19000 -|https://purdue-student.webex.com/meet/bkaturu - -|Michael Kruse -|19000 -|https://purdue-student.webex.com/meet/kruseml - -|Ankush Maheshwari -|19000 -|https://purdue-student.webex.com/meet/mahesh20 - -|Hyeong Park -|19000 -|https://purdue-student.webex.com/meet/park1119 - -|Vandana Prabhu -|19000 -|https://purdue-student.webex.com/meet/prabhu11 - -|Meenu Ramakrishnan -|19000 -|https://purdue-student.webex.com/meet/ramakr20 - -|Rthvik Raviprakash -|19000 -|https://purdue-student.webex.com/meet/rravipra - -|Chintan Sawla -|19000 -|https://purdue-student.webex.com/meet/csawla - -|Mridhula Srinivasa -|19000 -|https://purdue-student.webex.com/meet/sriniv99 - -|Tanya Uppal -|19000 -|https://purdue-student.webex.com/meet/tuppal - -|Keerthana Vegesna -|19000 -|https://purdue-student.webex.com/meet/vvegesna - -|Maddie Woodrow -|19000 -|https://purdue-student.webex.com/meet/mwoodrow - -|Adrienne Zhang -|19000 -|https://purdue-student.webex.com/meet/zhan4000 -|=== - -[cols="1,1,1,1,1,1,1"] -|=== -|Time (ET) |Sunday |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:30 AM - 9:00 AM -| -.2+|Seminar: **Dr Ward**, Maddie Woodrow, Vandana Prabhu, Melissa Cai Shi, Jonah Hu, Mridhula Srinivasan, Michael Kruse -|Chintan Sawla -|Chintan Sawla -|Ishika Kamchetty, Jackson Karshen -|Chintan Sawla, Michael Kruse - - -|9:00 AM - 9:30 AM -| -|Chintan Sawla -|Chintan Sawla, Maddie Woodrow -|Ishika Kamchetty, Jackson Karshen -|Chintan Sawla, Michael Kruse - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -| -.2+|Seminar: **Dr Ward**, Rthvik Raviprakash, Jonah Hu, Bhargavi Katuru, Sumeeth Guda (last half) -|Chintan Sawla -|Chintan Sawla, Maddie Woodrow -|Ishika Kamchetty, Jackson Karshen -|Chintan Sawla, Nihar Chintamaneni - -|10:00 AM - 10:30 AM -| -|Shreyas Chickerur -|Maddie Woodrow -|Mridhula Srinivasan, Ishika Kamchetty -|Maddie Woodrow, Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -| -.2+|Seminar: **Dr Ward**, Michael Kruse, Sumeeth Guda, Ishika Kamchetty, Rthvik Raviprakash, Pramey Kabra, Bhargavi Katuru -|Shreyas Chickerur -|Michael Kruse, Maddie Woodrow -|Mridhula Srinivasan, Ishika Kamchetty -|Maddie Woodrow, Nihar Chintamaneni - -|11:00 AM - 11:30 AM -| -|Shreyas Chickerur -|Shreyas Chickerur, Michael Kruse -|Mridhula Srinivasan, Ishika Kamchetty -|Ankush Maheshwari, Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -|Shreyas Chickerur -| -|Shreyas Chickerur, Michael Kruse -|Mridhula Srinivasan, Ishika Kamchetty -|Ankush Maheshwari, Nihar Chintamaneni - -|12:00 PM - 12:30 PM -| -|Shreyas Chickerur -|Ishika Kamchetty -|Shreyas Chickerur, Michael Kruse -|Mridhula Srinivasan, Ishika Kamchetty -|Ankush Maheshwari, Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -|Shreyas Chickerur -|Ishika Kamchetty, Melissa Cai Shi -|Shreyas Chickerur, Tanya Uppal -|Rthvik Raviprakash, Sumeeth Guda -|Vandana Prabhu, Ankush Maheshwari - -|1:00 PM - 1:30 PM -| -|Shreyas Chickerur -|Melissa Cai Shi -|Shreyas Chickerur, Tanya Uppal -|Rthvik Raviprakash, Sumeeth Guda -|Vandana Prabhu, Maddie Woodrow - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -|Nihar Chintamaneni -|Melissa Cai Shi -|Tanya Uppal -|Rthvik Raviprakash, Sumeeth Guda -|Vandana Prabhu, Maddie Woodrow - -|2:00 PM - 2:30 PM -| -|Nihar Chintamaneni -|Mridhula Srinivasan -| -|Rthvik Raviprakash, Pramey Kabra -|Jonah Hu, Maddie Woodrow - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Nihar Chintamaneni -|Mridhula Srinivasan -|Jonah Hu -|Rthvik Raviprakash, Pramey Kabra -|Jonah Hu, Maddie Woodrow - -|3:00 PM - 3:30 PM -| -|Nihar Chintamaneni -| -|Jonah Hu -|Hyeong Park, Pramey Kabra, Keerthana Vegesna -|Jonah Hu, Sumeeth Guda - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -|Melissa Cai Shi -|Adrienne Zhang -| -|Hyeong Park, Keerthana Vegesna -|Jonah Hu, Sumeeth Guda - -|4:00 PM - 4:30 PM -| -|Melissa Cai Shi -|Adrienne Zhang -|Mridhula Srinivasan, Bhargavi Katuru (online) -|Hyeong Park -|Jonah Hu, Sumeeth Guda - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -| -.2+|Seminar: **Dr Ward**, Tanya Uppal, Jackson Karshen, Keerthana Vegesna, Bhargavi Katuru -|Adrienne Zhang -|Mridhula Srinivasan, Bhargavi Katuru (online) -|Hyeong Park, Pramey Kabra -|Jonah Hu, Sumeeth Guda - -|5:00 PM - 5:30 PM -| -|Adrienne Zhang -|Mridhula Srinivasan, Bhargavi Katuru (online) -|Hyeong Park, Pramey Kabra -|Tanya Uppal - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -| -| -|Adrienne Zhang -|Jackson Karshen -|Hyeong Park, Pramey Kabra -|Tanya Uppal, Bhargavi Katuru (online) - -|6:00 PM - 6:30 PM -| -| -|Tanya Uppal -|Michael Kruse -|Jackson Karshen, Rthvik Raviprakash -|Bhargavi Katuru, Meenu Ramakrishnan - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -| -|Keerthana Vegesna -|Tanya Uppal -|Michael Kruse -|Jackson Karshen, Rthvik Raviprakash -|Bhargavi Katuru, Meenu Ramakrishnan - -|7:00 PM - 7:30 PM -| -|Keerthana Vegesna -|Tanya Uppal -|Vandana Prabhu -|Jackson Karshen, Rthvik Raviprakash, Ankush Maheshwari -|Vandana Prabhu, Meenu Ramakrishnan - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -|Keerthana Vegesna -|Adrienne Zhang -|Vandana Prabhu, Keerthana Vegesna -|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari -|Vandana Prabhu, Meenu Ramakrishnan - -|8:00 PM - 8:30 PM -| -|Hyeong Park -|Adrienne Zhang -|Chintan Sawla, Keerthana Vegesna -|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari -|Vandana Prabhu, Meenu Ramakrishnan - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -|Hyeong Park -|Adrienne Zhang -|Chintan Sawla, Keerthana Vegesna -|Jackson Karshen, Meenu Ramakrishnan, Ankush Maheshwari -|Meenu Ramakrishnan, Nihar Chintamaneni - -|9:00 PM - 9:30 PM -| -|Hyeong Park -|Adrienne Zhang -|Pramey Kabra, Chintan Sawla, Keerthana Vegesna -|Ankush Maheshwari, Meenu Ramakrishnan -|Meenu Ramakrishnan, Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 PM - 10:00 PM -| -|Hyeong Park -|Adrienne Zhang -|Pramey Kabra, Keerthana Vegesna -|Ankush Maheshwari, Sumeeth Guda -|Melissa Cai Shi, Meenu Ramakrishnan - -|10:00 PM - 10:30 PM -| -|Hyeong Park -|Adrienne Zhang -|Pramey Kabra -|Ankush Maheshwari, Sumeeth Guda -|Melissa Cai Shi - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 PM - 11:00 PM -| -|Hyeong Park -|Adrienne Zhang -|Pramey Kabra -|Ankush Maheshwari -|Melissa Cai Shi -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc deleted file mode 100644 index c50389dfc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/29000-f2021-officehours.adoc +++ /dev/null @@ -1,321 +0,0 @@ -= STAT 29000 and 39000 Office Hours for Fall 2021 - -It might be helpful to also have the office hours for STAT 19000: - -xref:logistics/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021] - -and it might be helpful to look at the -xref:logistics/officehours.adoc[general office hours policies]. - -The STAT 29000 and 39000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Jacob Bagadiong -|29000 -|https://purdue-student.webex.com/meet/jbagadio - -|Darren Iyer -|29000 -|https://purdue-student.webex.com/meet/iyerd - -|Rishabh Rajesh -|29000 -|https://purdue-student.webex.com/meet/rajeshr - -|Haozhe Zhou -|29000 -|https://purdue-student.webex.com/meet/zhou929 - -|Nikhil D'Souza -|39000 -|https://purdue-student.webex.com/meet/dsouza13 -|=== - -[cols="1,1,1,1,1,1"] -|=== -|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:00 AM - 9:00 AM -| -| -| -|Jacob Bagadiong -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 AM - 9:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou -| -| -|Jacob Bagadiong -|Rishabh Rajesh - -|9:00 AM - 9:30 AM -| -| -|Jacob Bagadiong -|Rishabh Rajesh - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou, Nikhil D'Souza -| -| -|Jacob Bagadiong -|Haozhe Zhou - -|10:00 AM - 10:30 AM -| -|Darren Iyer -| -|Haozhe Zhou - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou -| -|Darren Iyer -| -|Haozhe Zhou - -|11:00 AM - 11:30 AM -| -|Darren Iyer -| -|Haozhe Zhou - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|12:00 PM - 12:30 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|1:00 PM - 1:30 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -| -| -| -| - -|2:00 PM - 2:30 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -|Darren Iyer -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -|Darren Iyer -| - -|3:00 PM - 3:30 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -| -|Darren Iyer - -|4:00 PM - 4:30 PM -| -| -|Rishabh Rajesh -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -.2+|Seminar: **Dr Ward**, Jacob Bagadiong -| -|Rishabh Rajesh -| -|Darren Iyer - -|5:00 PM - 5:30 PM -| -| -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -|| -| -| -|Darren Iyer - - -|6:00 PM - 6:30 PM -|Nikhil D'Souza -|Nikhil D'Souza -|Jacob Bagadiong -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -|Nikhil D'Souza -|Nikhil D'Souza -|Jacob Bagadiong -| -| - -|7:00 PM - 7:30 PM -|Nikhil D'Souza -|Nikhil D'Souza -| -|Rishabh Rajesh -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -| -| -|Rishabh Rajesh -| - -|8:00 PM - 8:30 PM -| -| -| -|Rishabh Rajesh -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -| -| -|Rishabh Rajesh -| -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc b/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc deleted file mode 100644 index c50389dfc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2021/logistics/39000-f2021-officehours.adoc +++ /dev/null @@ -1,321 +0,0 @@ -= STAT 29000 and 39000 Office Hours for Fall 2021 - -It might be helpful to also have the office hours for STAT 19000: - -xref:logistics/19000-f2021-officehours.adoc[STAT 19000 Office Hours for Fall 2021] - -and it might be helpful to look at the -xref:logistics/officehours.adoc[general office hours policies]. - -The STAT 29000 and 39000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Jacob Bagadiong -|29000 -|https://purdue-student.webex.com/meet/jbagadio - -|Darren Iyer -|29000 -|https://purdue-student.webex.com/meet/iyerd - -|Rishabh Rajesh -|29000 -|https://purdue-student.webex.com/meet/rajeshr - -|Haozhe Zhou -|29000 -|https://purdue-student.webex.com/meet/zhou929 - -|Nikhil D'Souza -|39000 -|https://purdue-student.webex.com/meet/dsouza13 -|=== - -[cols="1,1,1,1,1,1"] -|=== -|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:00 AM - 9:00 AM -| -| -| -|Jacob Bagadiong -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 AM - 9:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou -| -| -|Jacob Bagadiong -|Rishabh Rajesh - -|9:00 AM - 9:30 AM -| -| -|Jacob Bagadiong -|Rishabh Rajesh - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou, Nikhil D'Souza -| -| -|Jacob Bagadiong -|Haozhe Zhou - -|10:00 AM - 10:30 AM -| -|Darren Iyer -| -|Haozhe Zhou - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -.2+|Seminar: **Dr Ward**, Haozhe Zhou -| -|Darren Iyer -| -|Haozhe Zhou - -|11:00 AM - 11:30 AM -| -|Darren Iyer -| -|Haozhe Zhou - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|12:00 PM - 12:30 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|1:00 PM - 1:30 PM -| -| -| -| -|Nikhil D'Souza (WebEx) - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -| -| -| -| - -|2:00 PM - 2:30 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -|Darren Iyer -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -|Darren Iyer -| - -|3:00 PM - 3:30 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -|Jacob Bagadiong -|Rishabh Rajesh -| -|Darren Iyer - -|4:00 PM - 4:30 PM -| -| -|Rishabh Rajesh -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -.2+|Seminar: **Dr Ward**, Jacob Bagadiong -| -|Rishabh Rajesh -| -|Darren Iyer - -|5:00 PM - 5:30 PM -| -| -| -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -|| -| -| -|Darren Iyer - - -|6:00 PM - 6:30 PM -|Nikhil D'Souza -|Nikhil D'Souza -|Jacob Bagadiong -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -|Nikhil D'Souza -|Nikhil D'Souza -|Jacob Bagadiong -| -| - -|7:00 PM - 7:30 PM -|Nikhil D'Souza -|Nikhil D'Souza -| -|Rishabh Rajesh -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -| -| -|Rishabh Rajesh -| - -|8:00 PM - 8:30 PM -| -| -| -|Rishabh Rajesh -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -| -| -|Rishabh Rajesh -| -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc deleted file mode 100644 index a043f1434..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project01.adoc +++ /dev/null @@ -1,314 +0,0 @@ -= TDM 10100: Project 1 -- 2022 - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** r, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Read and write basic (csv) data using R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` -- `/anvil/projects/tdm/data/disney/flight_of_passage.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5vtofjko?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1gf9pnt2?wid=_983291"></iframe> -++++ - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -.Items to submit -==== -- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil. -- A sentence explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 - -We will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb. - -[NOTE] -==== -If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2022-s2023:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2022-s2023-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you. - -[NOTE] -==== -Soon, we'll have the f2022-s2023-r kernel available and ready to use! -==== - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on? - -[source,r] ----- -%%R - -system("hostname", intern=TRUE) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node on Anvil that you are running on). -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6s6gsi1e?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_708jtb6h?wid=_983291"></iframe> -++++ - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know! - -Practice running the following examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -[NOTE] -==== -In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements! - -[source,ipython] ----- -%load_ext sql ----- -==== - -bash:: -[source,bash] ----- -%%bash - -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv ----- - -[TIP] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_mcz06hz6?wid=_983291"></iframe> -++++ - -In question (1) we answered questions about cores and memory for the Anvil clusters. To do so, we needed to perform some arithmetic. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xjiimzfw?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_34dqck6l?wid=_983291"></iframe> -++++ - -In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code. - -[source,ipython] ----- -%%R - -dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv") ----- - -Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset. - -[source,r] ----- -%%R - -head(dat) ----- - -[IMPORTANT] -==== -Remember -- if you are in a _new_ code cell, you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good! -==== - -`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`. - -Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure to include a level 2 header titled "Question 6", above your code cell. - -[TIP] -==== -In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. -==== - -[NOTE] -==== -We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this. - -[source,r] ----- -flight_of_passage <- dat ----- - -Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5Gb dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset. - -[source,r] ----- -rm(dat) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dsk4jniu?wid=_983291"></iframe> -++++ - -Let's pretend we are now done with the project. We've written some code, maybe added some markdown cells to explain what we did, and we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook, and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb`, if a project uses R code, you will need to also submit R code in an R script. An R script is just a text file with the extension `.R`. When submitting Python code, you will need to also submit a Python script. A Python script is just a text file with the extension `.py`. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Next, take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Download your `.ipynb` file -- making sure that the output from all of your code is present and in the notebook (the `.ipynb` file will also be referred to as "your notebook" or "Jupyter notebook"). - -Once complete, submit your notebook, R script, and Python script. - -.Items to submit -==== -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc deleted file mode 100644 index 356f71284..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project02.adoc +++ /dev/null @@ -1,265 +0,0 @@ -= TDM 10100: Project 2 -- 2022 -Introduction to R part I - -The Rfootnote:[R is case sensitive] environment is a powerful tool to perform data analysis. R's language is often compared to Python. Both languages have their advantages and disadvantages, and both are worth learning. - -In this project we will dive in head first and learn some of the basics while solving data-driven problems. - - -.5 basic types of data -[%collapsible] -==== - * Values like 1.5 are called numeric values, real numbers, decimal numbers, etc. - * Values like 7 are called integers or whole numbers. - * Values TRUE or FALSE are called logical values or Boolean values. - * Texts consist of sequences of words (also called strings), and words consist of sequences of characters. - * Values such as 3 + 2ifootnote:[https://stat.ethz.ch/R-manual/R-devel/library/base/html/complex.html] are called complex numbers. We usually do not encounter these in The Data Mine. -==== - - - -[NOTE] -==== -R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes. - -In addition, there are many other languages and tools, e.g., https://julialang.org/[Julia] and https://www.rust-lang.org/[Rust] and https://go.dev/[Go] and many other languages are emerging as relatively newer languages that each have their own advantages. -==== - -**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples. - -In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis. - -**Scope:** xref:programming-languages:R:index.adoc[r], xref:programming-languages:R:lists-and-vectors.adoc[vectors, lists], indexing - -.Learning Objectives -**** -- Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames - -- Be able to explain and demonstrate: positional, named, and logical indexing. -- Read and write basic (csv) data using R. -- Identify good and bad aspects of simple plots. - -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv", stringsAsFactors = TRUE) ----- - -== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gs98maih?wid=_983291"></iframe> -++++ - -The data that we may be working with does not always come to us neat and cleanfootnote:["Raw data" vs "Clean data". Some datasets require "cleaning" such as removing duplicates, removing null values and disgarding irrelevent data]. It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems. - -.Insider Knowledge -[%collapsible] -==== -Datasets can be thought or as one or more observations of one or more variables. For most datasets, each row is an observation and each column is a variable. (There may be some datasets do not follow that convention.) -==== - -We are going to use the `read.csv` function to load our datasets into a dataframe named ... + -We want to use functions such as `head`, `tail`, `dim`, `summary`, `str`, `class`, to get a better understanding of our dataframe(DF). - -.Helpful Hints -[%collapsible] -==== -[source,r] ----- -#looks at the head of the dataframe -head(myDF) -#looks at the tail of the dataframe -tail(myDF) -#returns the type of data in a column of the dataframe, for instance, the type of data in the column that stores the destination airports of the flights -class(myDF$Dest) ----- -==== -[loweralpha] -.. How many columns does this dataframe have? -.. How many rows does this dataframe have? -.. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.) - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the three questions above -==== - -== TWO -We can create a new vectorfootnote:[https://sudo-labs.github.io/r-data-science/vectors/] containing all of the origin airports (i.e., the airports where the flights departed) from the column `myDF$Origin` of the data frame `myDF`. -[source,r] ----- -#takes the selected information from the dataframe and puts it into a new vector called `myairports` -myairports <- myDF$Origin ----- - -.Insider Knowledge -[%collapsible] -==== -A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc. -==== -To assist with this question, please also see the end of the video from Question 1 (above). -[loweralpha] -.. What type of data is in the vector `myairports`? -.. The vector `myairports` contains all of the airports where flights departed in 1995. Print the first 250 of those airports. [Do not print all of the airports, because there are 5327435 such values!] How many of the first 250 flights departed from O'Hare? -.. How many flights departed by O'Hare altogether in 1995? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - -== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_2rz4ge9s?wid=_983291"></iframe> -++++ - -Indexing - -.Insider Knowledge -[%collapsible] -==== -Accessing data can be done in many ways, one of those ways is called **_indexing_**. Typically we use brackets **[ ]** when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certian range within the column. Some examples of symbols to help us select elements include: + - * < less than + - * > greater than + - * <= less than or equal to + - * >= greater than or equal to + - * == is equal + - * != is not equal + -It is also important to note that indexing in R begins at 1. (This means that the first row of the dataframe will be numbered starting at 1.) -==== -.Helpful Hints -[%collapsible] -==== -[source,r] ----- -#finding data by their indices -myDF$Distance[row_index_start:row_index_end,] -#creates a new vector with the specific info -mynewvector <- myDF$putcolumnnamehere -#all of the data from row 3 -myDF[3,] -#all of the data in all of the rows, with columns between myfirstcolumn and mylastcolumn -myDF[,myfirstcolumn:mylastcolumn] -#and/or -#the first 250 values from column 17 -head(myDF[,17], n=250) -#puts all variables that are less than 6 from the dataframe -longdistances = myDF$Distance[myDF$Distance > 2000] ----- -==== -[loweralpha] -.. How many flights departed from Indianapolis (`IND`) in 1995? How many flights landed there? -.. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive? -.. How many flights have a distance of less than 200 miles? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - -== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_j3i2kwfp?wid=_983291"></iframe> -++++ - -Summarizing vectors using tables + - -The `table` command is helpful to know, for summarizing large quantities of data. - - -.Insider Knowledge -[%collapsible] -==== -It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. -Remember also that R is a case-sensitive language. -[source,r] ----- -table(myDF$Origin) # summarizes how many flights departed from each airport -sort(table(myDF$Origin)) # sorts those results in numeric order -tail(sort(table(myDF$Origin)),n=10) # finds the 10 most popular airports, according to the number of flights that departed from each airport. ----- - -==== -[loweralpha] -.. Rank the airline companies (in the column `myDF$UniqueCarrier`) according to their popularity, i.e., according to the number of flights on each airline). -.. Which are the three most popular airlines from 1995? -.. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - -== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_o558x1ek?wid=_983291"></iframe> -++++ - -Basic graph types are helpful for visualizing data. They can be an important tool in discovering insights into the data you are working with. + -R has a number of tools built in for basic graphs, such as scatter plots, bar charts, histograms, etc. - -.Insider Knowledge -[%collapsible] -==== -A dot plot, also known as a dot chart, is similar to a bar chart or a scatter plot. In R, the categories are displayed along the vertical axis and the corresponding values are displayed according to the horizontal axis. + - -We can assign groups a color to help differentiate while plotting a dot chart + - -We can also plot a column that we find interesting as well to take a look at what the data might show us. -For example if we wanted to see if there was a difference in days of the week and number of flights, we would use `hist`. -[source,r] ----- -mydays<- myDF$DayOfWeek -hist(mydays) ----- - -==== - -.Helpful Hints -[%collapsible] -==== -[source,r] ----- -mycities <- tail(sort(table(myDF$Origin)),n=10) -dotchart(mycities, pch = 21, bg = "green", pt.cex = 1.5) ----- -==== -[loweralpha] -.. Pick a column of data that you are interested in studying, or a question that you want answered. Create either a `plot`, or a `dotchart`. Before making the plot, think about how many dots will be displayed on your `plot` or `dotchart`. If you try to display millions of dots, you might cause your Jupyter Lab session to freeze or crash. It is useful to think ahead and to consider how your plot might look, before you accidentally try to display millions of dots. -.. Descibe any patterns you may see in your plot or your dotchart. If there are none, that is okay, and you can just write "there seem to be no patterns." - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The plot or dotchart and your commentary about what you created and what you observed. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc deleted file mode 100644 index 1601bc66a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project03.adoc +++ /dev/null @@ -1,241 +0,0 @@ -= TDM 10100: Project 3 -- Fall 2022 -Inroduction to R part II - -**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`. - -**Context:** In the previous project we, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we've already learned and introduce a new, flexible data structure called `data.frame`s. - -**Scope:** r, data.frames, recycling, factors - -.Learning Objectives -**** -- - Explain what "recycling" is in R and predict behavior of provided statements. -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_bulc5ddx?wid=_983291"></iframe> -++++ - -[TIP] -==== -As described in the video above, Dr Ward is using: - -`options(jupyter.rich_display = F)` - -so that the work using the kernel `f2022-s2023-r` looks similar to the work using the kernel `f2022-s2023`. We will probably make this option permanent in the future, but I just wanted to point this out. You do not have to do this, but I like the way it the output looks with this option. -==== - -Using the *f2022-s2023-r* kernel, -lets first see all of the files that are in the Disney folder -[source,r] ----- -list.files("/anvil/projects/tdm/data/disney") ----- - -After looking at several of the files we will go ahead and read in the data frame on the 7 Dwarfs Train. -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv", stringsAsFactors = TRUE) ----- - -If we want to see the file size (aka how large) of the CSV. -[source,r] ----- -file.info("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv")$size ----- -You can also use `file.info` to see other information about the file. - -.Insider Knowledge -[%collapsible] -==== -*size*- double: File size in bytes. + -isdir- logical: Is the file a directory? + -*mode*- integer of class "octmode". The file permissions, printed in octal, for example 644. + -*mtime, ctime, atime*- integer of class "POSIXct": file modification, ‘last status change’ and last access times. + -*uid*- integer: the user ID of the file's owner. + -*gid*- integer: the group ID of the file's group. + -*uname*- character: uid interpreted as a user name. -grname + -character: gid interpreted as a group name. Unknown user and group names will be NA. -==== - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_39itb6gk?wid=_983291"></iframe> -++++ - -Familiarizing yourself with the data. - -.Helpful Hint -[%collapsible] -==== -You can look at the first 6 rows (`head`) and the last 6 rows (`tail`). The structure (`str`) and/or the dimentions (`dim`) of the dataset. + - -*"SACTMIN"* is the actual minutes that a person waited in line + -*"SPOSTMIN"* is the time about the ride, estimating the wait time. (Any value that is -999 means that the ride was not in service) + -*"datetime"* is the date and time the information was recorded + -*"date"* is the date of the event -==== - -In the last project we learned about how to look at the data.frame. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds. Use the head command to look at the first 21 rows. - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining the dataset. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_d2aor19k?wid=_983291"></iframe> -++++ - -Now that we have a better understanding of content and structure of our data. We are diving a bit deeper and making connections within the data. - -[loweralpha] -.. If we are looking at the column *"SPOSTMIN"* what do you notice about the increments of time? I.e., is there anything special about the types of values that appear? How many different wait time options do you see in *"SPOSTMIN"*? -.. How many `NA` values do you see in *"SPOSTMIN"*? -.. Create a new data frame with the name `newDF` in which the *"SPOSTMIN"* column has all `NA` values removed. In other words, select the rows of `myDF` for which *"SPOSTMIN"* is not `NA` and call the resulting `data.frame` by the name `newDF`. - -.Insider Knowledge -[%collapsible] -==== -`na.omit` and `na.exclude` returns objects with the observations removed if they contain any missing values. As well as performs calculations by considering the NA values but does not include them in the calculation. + -`na.rm` first [.underline]#removes the NA values and then# does the calculation. + -`na.pass` returns the object unchanged + -It is also possible to use the `subset` function and the `is.na` function. -==== - -.Helpful Hint -[%collapsible] -==== -Use the code below -[source,r] ----- -table(myDF$SPOSTMIN) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answer to the 3 questions above. -==== -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_p8qawzbk?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ez33iof3?wid=_983291"></iframe> -++++ - -Use the `myDF` data.frame for this question. -[loweralpha] -.. On Christmas day, what was the average wait time? On July 26th, what was the average wait time? -.. Is there a difference between the wait times in the summer and the holidays? -.. On which date do the most entries occur in the data set? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answer to the 3 questions above. -==== - -==== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hzxe468h?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ourx5zju?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_o5i6k7w1?wid=_983291"></iframe> -++++ - -Recycling in R + - -.Insider Knowledge -[%collapsible] -==== -Recycling happens in R automatically. When you are attempting to preform operations like addition, subtraction on two vectors of unequal length. + -The shorter vector will be repeated as long as the operation is completing on the longer vector. -==== - -[loweralpha] -.. Find the lengths of the column *"SPOSTMIN"* in the `myDF` and `newDF`. -.. Create a new vector called `myhours` by adding together *"SPOSTMIN"* columns from `myDF` and `newDF` with each divided by 60. What is the length of that new vector `myhours`? -.. What happened in row 313997? Why? - - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - - -==== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_3yxfvg2e?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_z8vimoe9?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_0zdn9p1p?wid=_983291"></iframe> -++++ - -Indexing and Expanding dataframes in R - -[source,r] ----- -library(lubridate) -myDF$weekday <- wday(myDF$datetime, label=TRUE) ----- - -[loweralpha] -.. Consider the average wait times. What day of the week in `myDF` has the longest average wait times? -.. Make a plot and a dotchart that illustrate the data for the average wait times. Which one conveys the information better and why? -.. We created a new column in `myDF` that shows the weekdays. Do the same thing for part (a) and (b) again, but this time using the months instead of the days of the week. - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - - - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc deleted file mode 100644 index fbdd77d1b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project04.adoc +++ /dev/null @@ -1,262 +0,0 @@ -= TDM 10100: Project 4 -- Fall 2022 -Introduction to R part III - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1xixgdte?wid=_983291"></iframe> -++++ - -Many data science tools including xref:programming-languges:R:introduction[R] have powerful ways to index data. - -.Insider Knowledge -[%collapsible] -==== -R typically has operations that are vectorized and there is little to no need to write loops. + -R typically also uses indexing instead of using an if statement. - -* Sequential statements (one after another) i.e. + -1. print line 45 + -2. print line 15 + - -**if/else statements** - create an order of direction based on a logical condition. + - -if statement example: -[source,r] ----- -x <- 7 -if (x > 0){ -print ("Positive number") -} ----- -else statement example: -[source,r] ----- -x <- -10 -if(x >= 0){ -print("Non-negative number") -} else { -print("Negative number") -} ----- -In `R`, we can classify many numbers all at once: -[source,r] ----- -x <- c(-10,3,1,-6,19,-3,12,-1) -mysigns <- rep("Non-negative number", times=8) -mysigns[x < 0] <- "Negative number" -mysigns ----- - -==== -**Context:** As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`. - -**Scope:** r, data.frames, indexing. - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - - -Using the *f2022-s2023-r* kernel -Lets first see all of the files that are in the `craigslist` folder -[source,r] ----- -list.files("/anvil/projects/tdm/data/craigslist") ----- - -After looking at several of the files we will go ahead and read in the data frame on the Vehicles -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE) ----- - -.Helpful Hints -[%collapsible] -==== -Remember: + - -* If we want to see the file size (aka how large) of the CSV. -[source,r] ----- -file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size ----- - -* You can also use 'file.info' to see other information about the file. -==== - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gbvaezhp?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_kmfxfx9i?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_b18vvxti?wid=_983291"></iframe> -++++ - -It is so important that, each time we look at data, we start by becoming familiar with the data. + -In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice. - -This dataset has 25 columns, and we are unable to see it all without adjusting the width. We can do this by -[source,r] ----- -options(repr.matrix.max.cols=25, repr.matrix.max.rows=200) ----- -and we also remember (from the previous project) that we can set the output in `R` to look more natural this way: -[source,r] ----- -options(jupyter.rich_display = F) ----- - - -.Helpful Hint -[%collapsible] -==== -You can look at the first 6 rows (`head`), the last 6 rows (`tail`), the structure (`str`), and/or the dimensions (`dim`) of the dataset. -==== - -[loweralpha] -.. How many unique regions are there in total? Name 5 of the different regions that are included in this dataset. -.. How many cars are manufactured in 2011 or afterwards, i.e., they are made in 2011 or newer? -.. In what year was the oldest model manufactured? In what year was the most recent model manufactured? In which year were the most cars manufactured? - -.Helpful Hint -[%collapsible] -==== -To sort and order a single vector you can use this code: -[source,r] ----- -head(myDF$year[order(myDF$year)]) ----- -You can also use the `sort` function, as demonstrated in earlier projects. -==== -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the 3 questions above. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jirr54ck?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ad2lowil?wid=_983291"></iframe> -++++ - -[loweralpha] -.. Create a new column in your data.frame that is labeled `newflag` which indicates if the vehicle for sale has been labeled as `like new`. In other words, the column `newflag` should be `TRUE` if the vehicle on that row is `like new`, and `FALSE` otherwise. -.. Create a new column called `pricecategory` that is -... `cheap` for vehicles less than or equal to $1,500 -... `average` for vehicles strictly more than $1,500 but less than or equal to $10,000 -... `expensive` for vehicles strictly more than $10,000 -.. How many cars are there in each of these three `pricecategories` ? - - -.Helpful Hint -[%collapsible] -==== -Remember to consider any 0 values and or `NA` values - -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answer to the questions above. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hwgeymvn?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_bl46t9fu?wid=_983291"></iframe> -++++ - -_**vectoriztion**_ - -Most of R's functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing. - -[loweralpha] -.. Using the `table()` function, and the column `myDF$newflag`, identify how many vehicles are `like new` and how many vehicles are not `like new`. -.. Now using the `cut` function and appropriate `breaks`, create a new column called `newpricecategory`. Verify that this column is identical to the previously created `pricecategory` column, created in question TWO. -.. Make another column called `odometerage`, which has values `new` or `middle age` or `old`, according to whether the odometer is (respectively): less than or equal to 50000; strictly greater than 50000 and less than or equal to 100000; or strictly greater than 100000. How many cars are in each of these categories? - -.Helpful Hint -[%collapsible] -==== -[source,r] ----- -cut(myvector, breaks = c(10,50,200) , labels = c(a,b,c)) ----- -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answer to the questions above. -==== - -==== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_d63ydjm8?wid=_983291"></iframe> -++++ - -**Preparing for Mapping** - -[loweralpha] -.. Extract all of the data for `indianapolis` into a `data.frame` called `myIndy` -.. Identify the most popular region from `myDF`, and extract all of the data from that region into a `data.frame` called `popularRegion`. -.. Create a third `data.frame` with the data from a region of your choice - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answer to the questions above. -==== - - -==== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_t9gpji8v?wid=_983291"></iframe> -++++ - -**Mapping** - -Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question FOUR. - - - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The answers to the 3 questions above. -==== - - - - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc deleted file mode 100644 index 811d5e02b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project05.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= TDM 10100: Project 5 -- Fall 2022 -Tapply and DataFrames - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ttsk2n6t?wid=_983291"></iframe> -++++ - -**Motivation:** `R` differs from other programing languages that _typically_ work best using vectorized functions and the _apply_ suite instead of using loops. - -.Insider Knowledge -[%collapsible] -==== -Apply Functions: are an alternative to loops. You can use *`apply()`* and its varients (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()...) to manuiplate peices of data from data.frames, lists, arrays, matrices in a repetative way. The *`apply()`* functions allow for flexiabilty in crossing data in multiple ways that a loop does not. -==== - -**Context:** We will focus in this project on efficient ways of processing data in `R`. - -**Scope:** r, data.frames, recycling, factors, if/else, for loops, apply suite - -.Learning Objectives -**** -- Demonstrate the ability to use the `tapply` function. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s) in anvil: - -/anvil/projects/tdm/data/election/escaped2020sample.txt - -.Helpful Hint -[%collapsible] -==== -A txt and csv file both sore information in plain text. *csv* files are _always_ separated by commas. In *txt* files the fields can be separated with commas, semicolons, or tab. - - -To read in a txt file as a csv we simply add sep="|" (see code below) -[source,r] ----- - myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|") ----- -==== - -== Questions - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_t2adfk4u?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_undlfl0o?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_db31nrf8?wid=_983291"></iframe> -++++ - -Read the dataset `escaped2020sample.txt` into a data.frame called `myDF`. The dataset contains contribution information for the 2020 election year. - -The dataset has a column named `TRANSACTION_DT` which is set up in the `[month].[day].[year]` format. -We want to organize the dates in chronological order. - -When working with dates, it is important to use tools specifically for this purpose (rather than using string manipulation, for example). We've provided you with the code below. The provided code uses the `lubridate` package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] in case you'd like to learn more about the package. - -[source,r] ----- -library(lubridate, warn.conflicts = FALSE) ----- - -[loweralpha] -.. Use the `mdy` function (from the `lubridate` library) on the column `TRANSACTION_DT`, to create a new column named `newdates`. -.. Using `tapply`, add the values in the `TRANSACTION_AMT` column, according to the values in the `newdate` column. -.. Plot the dates on the x-axis and the information we found in part b on the y-axis. - -.Helpful Hint -[%collapsible] -==== -*tapply()* helps us to compute statistical measures such as mean, median, minimum, maximum, sum, etc... for data that is split into groups. *tapply()* is most helpful when we need to break up a vector into groups, and compute a function on each of the groups. -==== - -[WARNING] -==== -If your `tapply` in Question 1b hates you (e.g., it will absolutely not finish the `tapply`, even after a few minutes), then the fix described below will likely help. Please note that, after you run this fix, you need to reset your memory back to 5000 MB at time 4:16 in the video. - -You do not need to run this "fix" unless you have a cell like this, which should be running, but you are "stuck" on it: -==== - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_m37n59j2?wid=_983291"></iframe> -++++ - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_cg7vwni6?wid=_983291"></iframe> -++++ - -The plot that we just created in question one shows us that the majority of the data collected is found in the years 2018-2020. So we will focus on the year 2019. - -[loweralpha] -.. Create a new dataframe that only contains data for the dates in the range 01/01/2019-05/15/2019 -.. Plot the new dataframe -.. What do you notice about the data? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the questions above -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_uwajsx7z?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wu96qqja?wid=_983291"></iframe> -++++ - -Lets look at the donations by city and state - -[loweralpha] -.. Find the sum of the total donations contributed in each state. -.. Create a new column that pastes together the city and state. -.. Find the total donation amount for each city/state location. In the output do you notice anything suspicious in the result? How do you think that occured? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the questions above. -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5335wwv1?wid=_983291"></iframe> -++++ - -Lets take a look who is donating - -[loweralpha] -.. Find the type of data that is in the `NAME` columm -.. Split up the names in the `NAME` column, to extract the first names of the donors. (This will not be perfect, but it is our first attempt.) -.. How much money is donated (altogether) by people named `Mary`? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the questions above -==== - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dpsjs2t3?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_9bq3bc73?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_psgbfiqe?wid=_983291"></iframe> -++++ - -Employment status - -[loweralpha] -.. Using a `barplot` or `dotchart`, show the total amount of donations made by `EMPLOYED` vs `NOT EMPLOYED` individuals -.. What is the category of occupation that donates the most money? -.. Plot something that you find interesting about the employment and/or occupation columns - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining what is was you chose to plot and why -- Answering to the questions above -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc deleted file mode 100644 index cde2eac5f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project06.adoc +++ /dev/null @@ -1,118 +0,0 @@ -= TDM 10100: Project 6 -- Fall 2022 -Tapply, Tapply, Tapply - -**Motivation:** We want to have fun and get used to the function `tapply` - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -== Questions - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_uzpijpz5?wid=_983291"></iframe> -++++ - -Read the dataset `/anvil/projects/tdm/data/olympics/athlete_events.csv`, into a data.frame called `eventsDF`. (We do not need the `tapply` function for Question 1.) - -[loweralpha] -.. What are the years included in this data.frame? -.. What are the different countries participating in the Olympics? -.. How many times is each country represented? - - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the code above. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_c0jyoxd7?wid=_983291"></iframe> -++++ - -[loweralpha] -.. What is the average height of participants from each country? -.. What are the oldest ages of the athletes from each country? -.. What is the sum of the weights of all participants from each country? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the code above -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xedq6db3?wid=_983291"></iframe> -++++ - -Read the dataset `/anvil/projects/tdm/data/death_records/DeathRecords.csv` into a data.frame called `deathrecordsDF`. (We do not need the `tapply` function for Question 3.) - -[loweralpha] -.. What are the column names in this dataframe? -.. Change the column "DayOfWeekOfDeath" from numbers to weekdays -.. How many people died in total on each day of the week? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the questions above -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_04yocw5y?wid=_983291"></iframe> -++++ - -[loweralpha] -.. What is the average age of Females versus Males at death? -.. What is the number of Females who are married? Divorced? Widowed? Single? Now find the analogous numbers for Males. -.. Now solve both questions from 4b at one time, i.e., use one command to find the number of Females who are married, divorced, widowed, or single, and the number of Males in each of these four categories. You can compute all eight numbers with just one `tapply` command. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to the question above -==== - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_61wqo9eb?wid=_983291"></iframe> -++++ - -[loweralpha] -.. Using the two data sets create two separate graphs or plots on the data that you find interesting (one graph or plot for each of the two data sets in this project). Write 1-2 sentences on each one and why you found it interesting/what you noticed in the dataset. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc deleted file mode 100644 index 619e7e645..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project07.adoc +++ /dev/null @@ -1,181 +0,0 @@ -= TDM 10100: Project 7 -- 2022 - -**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R. - -**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data! - -**Scope:** r, subset, merge, tapply - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/titles.csv` -- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv` -- `/anvil/projects/tdm/data/movies_and_tv/people.csv` -- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xx9aqgc7?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_k13gnhii?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6y6cpgb8?wid=_983291"></iframe> -++++ - - -[IMPORTANT] -==== -Please select 6000 memory when launching Jupyter for this project. -==== - -Data can come in a lot of different formats and from a lot of different locations. It is not uncommon to have one or more files that need to be combined together before analysis is performed. `merge` is a popular function in most data wrangling libraries. It is extremely similar and essentially equivalent to a `JOIN` in SQL. - -Read in each of the datasets into data.frames called: `titles`, `episodes`, `people`, and `ratings`. - -[NOTE] -==== -Read the data in using the following code. `fread` is a _very_ fast and efficient way to read in data. - -[source,r] ----- -library(data.table) - -titles <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv")) -episodes <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv")) -people <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/people.csv")) -ratings <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv")) ----- -==== - -- What are all the different listed genres (in the `titles` table)? -- Look at the `years` column and the `genres` column. In which year did the most comedies debut? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_nbc3bscq?wid=_983291"></iframe> -++++ - -Use the `episode_title_id` column and the `title_id` column from the `episodes` and `titles` data.frame's (respectively) to merge the two data.frames. - -Ultimately, we want to end up with a new data.frame that contains the `primary_title` for every episodes in the `episodes` table. Use the `merge` function to accomplish this. - -[TIP] -==== -The `merge` function in `R` allows two data frames to be combined by common columns. This function allows the user to combine data similar to the way `SQL` would using `JOIN`s. https://www.codeproject.com/articles/33052/visual-representation-of-sql-joins[Visual representation of SQL Joins] -==== - -[TIP] -==== -This is also a really great https://www.datasciencemadesimple.com/join-in-r-merge-in-r/[explanation of merge in `R`]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ninb89fe?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ixuuv053?wid=_983291"></iframe> -++++ - -Use `merge` (a few times) to create a new data.frame that contains at least the following information for **only** the show called "Friends". "Friends" (the show itself) has a `title_id` of tt0108778. Each episode of Friends, has its own `title_id` which contains the information for the specific episode as well. - -- The `primary_title` of the **episode** -- call it `episode_title`. -- The `primary_title` of the **show itself** -- call it `show_title`. -- The `rating` of the show itself -- call it `show_rating`. -- The `rating` of the episode -- call it `episode_rating`. - -[TIP] -==== -Start by getting a subset of the `episodes` table that contains only information for the show Friends. That way, we aren't working with as much data. -==== - -Show the top 5 rows of your final data.frame that contain the top 5 rated episodes. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jqasawyv?wid=_983291"></iframe> -++++ - -Use regular old indexing to find all episodes of friends with an `episode_rating` greater than 9 and `season_number` of exactly 5. - -Repeat the process, but this time use the `subset` function instead. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ka66ocum?wid=_983291"></iframe> -++++ - -`subset` is a sometimes useful function that allows you to index data.frame's in a less verbose manner. Read https://the-examples-book.com/programming-languages/R/subset[this]. - -While it maybe appears to be a clean way to subset data, I'd suggest avoiding it over explicit long-form indexing. Read http://adv-r.had.co.nz/Computing-on-the-language.html[this fantastic article by Dr. Hadley Wickham on non-standard evaluation]. Take for example, the following (a bit contrived) example using the dataframe we got in question (3). - -[source,r] ----- -season_number = 6 -results[results$episode_rating > 9 & results$season_number == season_number,] -subset(results, episode_rating > 9 & season_number == season_number) ----- - -Read that provided article and do your best to explain _why_ `subset` gets a different result than our example that uses regular indexing. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc deleted file mode 100644 index 5d98f0086..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project08.adoc +++ /dev/null @@ -1,203 +0,0 @@ -= TDM 10100: Project 8 -- 2022 - -**Motivation:** Functions are an important part of writing efficient code. + -Functions allow us to repeat and reuse code. If you find you using a set of coding steps over and over, a function may be a good way to reduce your lines of code! - -**Context:** We've been learning about and using functions these last few weeks. + -To learn how to write your own functions we need to learn some of the terminology and components. - -**Scope:** r, functions - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Comprehend what a function is, and the components of a function in R. -**** - -== Dataset(s) - -We will use the same dataset(s) as last week: - -- `/anvil/projects/tdm/data/movies_and_tv/titles.csv` -- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv` -- `/anvil/projects/tdm/data/movies_and_tv/people.csv` -- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv` - - -[IMPORTANT] -==== -Please select 6000 memory when launching Jupyter for this project. -==== - -.Helpful Hints -[%collapsible] -==== -`fread`- is a fast and efficient way to read in data. - -[source,r] ----- -library(data.table) - -titles <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv")) -episodes <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv")) -people <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/people.csv")) -ratings <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv")) ----- -==== - -== Questions - -Writing our own function to make a repetitive operation easier by turning it into a single command. + - -Take care to name the function something concise but meaningful so that others can understand what the function can be understood by other users. + - -Function parameters can also be called formal arguments. - -.Insider Knowledge -[%collapsible] -==== -A function is an object that contains multiple interrelated statments put together in a predefined order when called(run). + - -Functions can be built-in or created by the user (user-defined). + - -.Some examples of built in functions are: - -* min(), max(), mean(), median() -* print() -* head() - -==== - -.Helpful Hints -[%collapsible] -==== -Syntax of a function -[source, R] ----- -what_you_name_the_function <- function (parameters) { - statement(s) that are executed when the function runs - the last line of the function is the returned value -} ----- -==== - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5tcpkrdc?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hfsql6l6?wid=_983291"></iframe> -++++ - -To gain a better insight into our data, let's make two simple plots: - -[loweralpha] -.. A grouped bar chart https://www.statmethods.net/graphs/bar.html[see an example here] -.. A line plot http://www.sthda.com/english/wiki/line-plots-r-base-graphs[see an example here] -.. What information are you gaining from either of these graphs? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hsmthbjo?wid=_983291"></iframe> -++++ - -For practice, now that you have a basic understanding of how to make a function, we will use that knowledge, applied to our dataset. - -Here are pieces of a function we will use on this dataset; put them in the correct order + - -* results <- merge(ratings_df, titles_df, by.x = "title_id", by.y = "title_id") -* } -* function(titles_df, ratings_df, ratings_of_at_least) -* return(popular_movie_results) -* { -* popular_movie_results <- results[results$type == "movie" & results$rating >= ratings_of_at_least, ] -* find_movie_with_at_least_rating <- - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_stzcmpa8?wid=_983291"></iframe> -++++ - -Take the above function and add comments explaining what the function does at each step. - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_v5ind2mk?wid=_983291"></iframe> -++++ - -[source,r] ----- -my_selection <- find_movie_with_at_least_rating(titles, ratings, 7.6) ----- - -Using the code above answer these questions. - -[loweralpha] -.. How many movies in total are there, which are above that limit? -.. Change the limits in the function from "at least 5.0" to "lower than 5.0". -.. How many movies have ratings lower than 5.0? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_qkcn1hut?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_bcc86jm1?wid=_983291"></iframe> -++++ - -Now create a function that takes a genre as the input and finds either -[loweralpha] -.. the movie from that genre that has the largest number of votes, OR -.. the movie from that genre that has the highest rating. - -(You don't need to do both. In the video, I discuss how to find the movie from that genre that has the highest rating.) - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc deleted file mode 100644 index ddce52640..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project09.adoc +++ /dev/null @@ -1,191 +0,0 @@ -= TDM 10100: Project 9 -- 2022 -:page-mathjax: true - -Benford's Law - -**Motivation:** -https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, including its infamous use in fraud detection. It also helps detect anomolies in naturally occurring datasets. - -**Scope:** 'R' and functions - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -* /anvil/projects/tdm/data/election/escaped2020sample.txt - -.Helpful Hint -[%collapsible] -==== -A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab. - -To read in a txt file as a csv we simply add sep="|" (see code below) - -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/election/escaped/escaped2020sample.txt", sep="|") ----- -==== - -== Questions - -https://www.statisticshowto.com/benfords-law/[Benford's law] (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small. + -It is basically a https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/probability-distribution/[probability distribution] that gives the likelihood of the first digit occurring, in a set of numbers. - -Another way to understand Benford's law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently. - -.Insider Knowledge -[%collapsible] -==== -A probability distrubution helps definte what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc. + - -* Basic probability distributions which can be shown on a probability distribution table. -* Binomial distributions, which have “Successes” and “Failures.” -* Normal distributions, sometimes called a Bell Curve. - -Remember that the sum of all the probablities in a distrubution is always 100% or 1 as a decimal. -==== - -.Helpful Hint -[%collapsible] -==== -This law only works for numbers that are *significand S(x)* which means any number that is set into a standard format. + - -To do this you must - -* Find the first non-zero digit -* Move the decimal point to the right of that digit -* Ignore the sign - -An example would be 9087 and -.9087 both have the *S(x)* as 9.087 - -It can also work to find the second, third and succeeding numbers. It can also find the probability of certian combinations of numbers. + - -Typically does not apply to data sets that have a minimum and maximum (restricted). And to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and not naturally occurring numbers. + - -Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford's law. -==== - -Benford's law is given by the equation below. - - -$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$ - -$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$) - -An example the probability of the first digit being a 1 is - -$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$ - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_2xdargcf?wid=_983291"></iframe> -++++ - -[loweralpha] - -.. Create a function called `benfords_law` that takes the argument `digit`, and calculates the probability of `digit` being the starting figure of a random number based on Benford's law. - -.. Create a vector named `digits` with numbers 1-9 - -.. Now use the `benfords_law` function to create a plot (could be a bar plot, line plot, dot plot, etc., anything is OK) that shows the likelihood of `digits` occurring - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_c5tlgurx?wid=_983291"></iframe> -++++ - -[loweralpha] -. Read in the elections data (we have used this previously) into a dataset named `myDF`. - -. Create a vector called `firstdigit` with the first digit from the `TRANSACTION_AMT` and then plot it (again, could be a bar plot, line plot, dot plot, etc., anything is OK). - -. Does it look like it follows Bedford's law? Why or why not? - -.Helpful Hint -[%collapsible] -==== -use this to help plot -[source,r] ----- -firstdigit <- as.numeric(firstdigit) -hist(firstdigit) ----- -==== -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4c4osnsj?wid=_983291"></iframe> -++++ - -Create a function that will look at both the `EMPLOYER` and the `OCCUPATION` columns and return a new data frame with an added column named `Employed` that is FALSE if `EMPLOYER` is "NOT EMPLOYED", -and is FALSE if `OCCUPATION` is "NOT EMPLOYED", -and is TRUE otherwise. - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -How many arguments does the above function have? -What does each line do? Use #comment to explain your function. - -Using a graph, can you show the percentage of individuals employed vs not employed? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_angoxw58?wid=_983291"></iframe> -++++ - -Write your own custom function! Make sure your function has at least two arguments and get creative. Your function could output a plot, or search and find information within the data.frame. Use what you have learned in Project 8 and 9 to help guide you. - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - - -.Resources -[%collapsible] -==== -* https://towardsdatascience.com/what-is-benfords-law-and-why-is-it-important-for-data-science-312cb8b61048["What is Benford's Law and Why is it Important for Data Science"] - -* - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc deleted file mode 100644 index 3c08f007f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project10.adoc +++ /dev/null @@ -1,204 +0,0 @@ -= TDM 10100: Project 10 -- 2022 -Creating functions and using tapply and sapply - -**Motivation:** As we have learned functions are foundational to more complex programs and behaviors. + -There is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. - -**Context:** -We will apply functions to entire vectors of data using `sapply`. We learned how to create functions, and now the next step we will take is to use it on a series of data. `sapply` is one of the best ways to do this in `R`. - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -* /anvil/projects/tdm/data/okcupid/filtered/users.csv -* /anvil/projects/tdm/data/okcupid/filtered/questions.csv - -.Helpful Hint -[%collapsible] -==== -read.csv() function automatically delineates by a comma`,` + -You can use other delimiters by using adding the `sep` argument + -i.e. `read.csv(...sep=';')` + - -Use the `readlines(...,n=x)` function to see the first x number of rows to identify what the character that you will use in the `sep` argument. -==== - - -== Questions - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ctbvxg90?wid=_983291"></iframe> -++++ - -We want to go ahead and load the datasets into data.frames named `users` and `questions`. Take a look at both data.frames and identify what is a part of each of them. What information is in each datatset, and how they are related? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1 or 2 sentences on the datasets. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_e5ht4hp0?wid=_983291"></iframe> -++++ - -Simply put, `grep` helps us to find a word within a string. In `R` `grep` is vectorized and can be applied to an entire vector of strings. We will use it to find the any questions that mention `google` in the data.frame `questions`. -[loweralpha] -.. What do you notice if you just use the function `grep()` and create a new variable google and then print that variable? - -.. Now that you know the row number, how can you take a look at the information there? - -(Bonus question: can find a shortcut to steps a & b?) - -.Helpful Hint -[%collapsible] -==== -https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/grep[*grep*] - `grep()` is a function in `R` that is used to search for matches of a pattern within each element of a string. -[source,r] ----- -grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, - fixed = FALSE, useBytes = FALSE, invert = FALSE) - -grepl(pattern, x, ignore.case = FALSE, perl = FALSE, - fixed = FALSE, useBytes = FALSE) ----- -==== - -.Insider Information -[%collapsible] -==== -Just an FYI refresh: + - -* `<-` is an assignment operator, it assigns values to a variable - -* Functions *must* be called using the round brackets aka parenthesis *`()`* - -* Square brackets *`[]`*, are also called `extraction operators` as they are used to help extract specific elements from a vector or matrix. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_zkieaj0p?wid=_983291"></iframe> -++++ - -[loweralpha] -.. Using the row from our previous question, which variable does this correspond with in the data.frame `users`? - -.. Knowing that the two possible answers are "No. Why spoil the mystery?" and "Yes, Knowledge is power!" What percentage of users do *NOT* google someone before the first date? - - -.Helpful Hint -[%collapsible] -==== -* Row 2172 in `questions` corresponds to column named `q170849` in `users` - -* The `table()` function can be used to quickly create frequency tables - -* The `prop.table()` function can calculate the value of each cell in a table as a proportion of all values. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_t4p2awp1?wid=_983291"></iframe> -++++ - -Using the ability to create a function *AND* `tapply` find the percentages of Female vs Male (Man vs Woman, as categorized in the users data.frame) who *DO* google someone before their date. - - - -.Helpful Hint -[%collapsible] -==== -* https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/tapply[`tapply()`] function can be used to apply some function to a vector that has been grouped by another vector. -`tapply(x, INDEX, FUNCTION)` -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_uxqrt3om?wid=_983291"></iframe> -++++ - -Using the ability to create a function *AND* using `sapply()` write a function that takes the string and removes everything after/including the _ from the `gender_orientation` column in the `users` data.frame. Or it is OK to solve this question as given in the video, without a function and without `sapply()`. - -meaning that Hetero_male -> Hetero, we want to do this for the entire column `gender_orientation` - - - -.Insider Information -[%collapsible] -==== -Sapply()- allows you to iterate over a list or vector _without_ the need to use a for loop which is typically a slow way to work in `R`. - -Remember the difference + -(a `very` brief summary of each) - -* A vector is the basic data structure in `R` they typically are atomic vectors and lists and have three common properties - * Type- typeof() - * Length- length() - * Attributes- attributes() -They are different due to the type of elements they hold. All elements in an atomic vector must be the same(they are also always "flat"), but elements of a list can be different types. -construction of lists are done by using the function `list()`. The construction of atomic vectors are done by using the function `c()`. -You can determine specific type by using functions like *is.character(), is.double(), is.integer(), is.logical()* - -* A matrix is a two-dimensional; rows and columns and all cells must be the same type. Can be created with the function `matrix()`. - -* An array can be one dimension multi-dimensional. An array with one dimension is similar (but not exact) as a vector. An array with two dimensions is similar (but not exact) as a matrix. An array with three or more dimensions is an n-dimensional array. can be created with the function `array()`. - -* A data frame is like a table, or like a matrix, *BUT* the columns can hold different types of data. -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - - -.*Resources* -[%collapsible] -==== -* https://www.geeksforgeeks.org/find-position-of-a-matched-pattern-in-a-string-in-r-programming-grep-function/ - -==== - - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project11.adoc deleted file mode 100644 index e69de29bb..000000000 diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc deleted file mode 100644 index 5475dfb4b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project12.adoc +++ /dev/null @@ -1,171 +0,0 @@ -= TDM 10100: Project 12 -- 2022 -Tidyverse and Lubridate -**Motivation:** -In the previous project we manipulated dates, this project we are going to take it a bit further and use Tidyverse, more specifically the Luberdate package. -Working with dates in `R` can require more attention than working with other object classes. These packages will help simplify some of the common tasks related to date data. + - -Dates and times can be complicated, not every year has 365 days, not every day has 24 hours, and not every minute has 60 seconds. Dates are difficult because they have to accommodate for the Earth's rotation and orbit around the sun as well as the occurrence of timezones, daylight savings etc. -Suffice to say that when focusing on dates and date-times in R the simpler the better. Lubridate helps do so. - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Convert strings to dates, and format dates using the lubridate package. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- /anvil/projects/tdm/data/zillow/State_time_series.csv - -== Questions -First lets import the libraries + - -* data.table -* lubridate -[source,r] ----- -library(data.table) # make sure to load data.table first -library(lubridate) # and then to load lubridate second; it will give you a warning in pink color but it is totally OK -# You need to load `data.table` first and `lubridate` second for this project, because they both define `wday` and we want the version from `lubridate` so we need to load it second! ----- -We are going to continue to dig into the Zillow time series data. - -=== ONE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ebro43gk?wid=_983291"></iframe> -++++ - -[loweralpha] -. Go ahead and read in the dataset as `states` -. Find the class and the type of the column named `Date` -. Are there multiple functions that will return the same or similar information? - - -.Insider Knowledge -[%collapsible] -==== -Reminder: + -- `class` shows the class of the specified object used as the arguments. The most common ones include but are not limited to: "numeric", "character", "logical", "date". + -- `typeof` shows you the type or storage mode of objects. The most common ones include but are not limited to: "logical", "integer", "double", "complex", "character", "raw" and "list" -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xhmfza9g?wid=_983291"></iframe> -++++ - -[loweralpha] -. In Project 11, we had to convert the `Date` column to a month, day, year format. Now convert the column `Date` into values from the class *Date*. (You can use lubridate to do so.) What do you think about the methods you have learned (so far) to convert dates? -. Create a new column in your data.frame `states` named `day_of_the_week` that shows (Sunday-Saturday). -. Lets create another column in the data.frame `states` that shows the days of the week as numbers. - - -[source,r] ----- -county$Date <- as.Date(county$Date, format="%Y-%m-%d") ----- - - -.Helpful Hint -[%collapsible] -==== -Take a look at the functions `ymd`, `mdy`, `dym` -==== - -.Helpful Hint -[%collapsible] -==== -- Take a look at the functions `month`, `year`, `day`, `wday`. -- The *label* argument is logical. It is also only available for wday() function. TRUE will display the day of the week as an ordered factor of character strings, such as "Sunday." FALSE will display the day of the week as a number. -- The *week_start* argument by default the days are counted as 1 means Monday, 7 means Sunday When label = TRUE, this will be the first level of the returned factor. You can set lubridate.week.start option to control this parameter. -==== - -.Insider Knowledge -[%collapsible] -==== -Default values of class *Date* in `R` is displayed as YYYY-MM-DD -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ej2mh83u?wid=_983291"></iframe> -++++ - -We want to see if there is a better month(s) for putting our house on the market? -[loweralpha] -. Use `tapply` to compare the average `DaysOnZillow_AllHomes` for all months. -. Make a barplot showing our results. - - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_fqtu8l4o?wid=_983291"></iframe> -++++ - -Find the information only for the year 2017 and call it `states2017`. Then create a lineplot that shows the average `DaysOnZillow_AllHomes` by `Date` using the `states2017` data. What do you notice? When was the best month/months for posting a home for sale in 2017? - -=== FIVE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5gw71nb1?wid=_983291"></iframe> -++++ - -Now we want to know if homes sell faster in different states? Lets look at Indiana, Maine, and Hawaii. Create a lineplot that uses `DaysOnZillow_AllHomes` by `Date` with one line per state. Use the `states2017` dataset for this question. Make sure to have each state line colored differently and have a legend to identify which is which. - -.Helpful Hint -[%collapsible] -==== -Use the `lines()` function to add lines to your plot + -Use the `ylim` argument to show all lines + -Use the `col` argument to identify and alter colors. -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc deleted file mode 100644 index 23f0cdd27..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-project13.adoc +++ /dev/null @@ -1,82 +0,0 @@ -= TDM 10100: Project 13 -- 2022 - -**Motivation:** This semester we took a deep dive into `R` and it's packages. Lets take a second to pat ourselves on the back for surviving a long semester and review what we have learned! - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- /anvil/projects/tdm/data/beer/beers.csv - -== Questions - -=== ONE - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_hjlpsvtu&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -Read in the dataset and into a data.frame called `beer` -[loweralpha] -. What is the file size, how many rows, columns and type of data? -. What is the average score for a `stout`? (consider a stout any named beer from the column `name` with the word `stout` in it) -. How many `Pale Ale's` are on this list? (consider a stout any named beer from the column `name` with the word `pale` and `ale` in it) - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_stltxfhx?wid=_983291"></iframe> -++++ - -. Plot or Graph all the beers that are available in the summer and their ratings. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_uwvx1gze?wid=_983291"></iframe> -++++ - -. Create a plot of the average rating of beer by country. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_4fzvjp6k?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5rnyo1ie?wid=_983291"></iframe> -++++ - -. Do `limited` runs of beer have a greater median rating than all others? -(consider limited to be any beer that has the word `Limited` in the `availablity` column) - -. Use the `unique` function to investigate the availablity column. Why are there different labels that are technically the same? - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc deleted file mode 100644 index 104f7f661..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/10100/10100-2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= TDM 10100 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$10100-2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2022/tdm10100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2022/tdm10100/home] - -== Syllabus - -See xref:fall2022/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc deleted file mode 100644 index c677b0ee6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project01.adoc +++ /dev/null @@ -1,282 +0,0 @@ -= TDM 20100: Project 1 -- 2022 - -**Motivation:** It’s been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems — they aren’t always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`. - -We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester. - -**Scope:** Jupyter Lab, R, Python, Anvil, markdown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Review R and Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5vtofjko?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1gf9pnt2?wid=_983291"></iframe> -++++ - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[NOTE] -==== -Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful? -==== - -.Items to submit -==== -- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil. -- A sentence explaining how many cores and how much memory is available, in total, for your own computer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb. - -[NOTE] -==== -It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up. - -[source,bash] ----- -squeue -u username # replace 'username' with your username ----- -==== - -[NOTE] -==== -If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2022-s2023:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2022-s2023-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you. - -[NOTE] -==== -Soon, we'll have the f2022-s2023-r kernel available and ready to use! -==== - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node on Anvil that you are running on). -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6s6gsi1e?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_708jtb6h?wid=_983291"></iframe> -++++ - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know! - -Practice running the following examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -[NOTE] -==== -In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements! - -[source,ipython] ----- -%load_ext sql ----- -==== - -bash:: -[source,bash] ----- -%%bash - -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv ----- - -[TIP] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 5 - -Markdown is well worth learning about. You may already be a Markdown expert, however, more practice never hurts. - -Create a Markdown cell in your notebook. - -Create both an _ordered_ and _unordered_ list. Create an unordered list with 3 of your favorite academic interests (some examples could include: machine learning, operating systems, forensic accounting, etc.). Create another _ordered_ list that ranks your academic interests in order of most-interested to least-interested. To practice markdown, **embolden** at least 1 item in you list, _italicize_ at least 1 item in your list, and make at least 1 item in your list formatted like `code`. - -[TIP] -==== -You can quickly get started with Markdown using this cheat sheet: https://www.markdownguide.org/cheat-sheet/ -==== - -[TIP] -==== -Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell. Include the following (at a minimum): - -- A header for this section (your choice of size) that says "About". -+ -[TIP] -==== -A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`. -==== -+ -- The text of your personal "About" section that you would feel comfortable uploading to LinkedIn. -- In the about section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dsk4jniu?wid=_983291"></iframe> -++++ - -Review your Python and R skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc deleted file mode 100644 index 570e44590..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project02.adoc +++ /dev/null @@ -1,312 +0,0 @@ -= TDM 20100: Project 2 -- 2022 - -**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook. - -**Context:** At this point in time, our Jupyter Lab system, using https://ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab. - -**Scope:** bash, Jupyter Lab - -.Learning Objectives -**** -- Distinguish differences in `/home`, `/anvil/scratch`, and `/anvil/projects/tdm`. -- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc. -- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc. -- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc. -- Use `man` to read and learn about UNIX utilities. -- Run `bash` commands from within Jupyter Lab. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -[IMPORTANT] -==== -If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything. -==== - -[NOTE] -==== -While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8. -==== - -=== Question 1 - -Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from https://ondemand.anvil.rcac.purdue.edu. - -Open your settings by navigating to menu:Settings[Advanced Settings Editor]. - -Explore the settings, and make at least 2 modifications to your environment, and list what you've changed. - -Here are some settings Kevin likes: - -- menu:Theme[Selected Theme > JupyterLab Dark] -- menu:Document Manager[Autosave Interval > 30] -- menu:File Browser[Show hidden files > true] -- menu:Notebook[Line Wrap > on] -- menu:Notebook[Show Line Numbers > true] -- menu:Notebook[Shut down kernel > true] - -Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings. - -- menu:Settings[Text Editor Key Map > emacs] - -[IMPORTANT] -==== -Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc. -==== - -.Items to submit -==== -- List (using a markdown cell) of the modifications you made to your environment. -==== - -=== Question 2 - -In the previous project, we used a tool called `awk` to parse through a dataset. This was an example of running bash code using the `f2022-s2023` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab. - -The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example. - -[source,ipython] ----- -!ls - -import pandas as pd -myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) -myDF.head() ----- - -[NOTE] -==== -This does _not_ require you to have other, Python code in the cell. The following is perfectly valid. - -[source,ipython] ----- -!ls -!ls -la /anvil/projects/tdm/ ----- - -With that being said, using this method, each line _must_ start with an exclamation point. -==== - -The second method is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`. - -[source,bash] ----- -# man is short for manual, to quit, press "q" -# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down. -man man ----- - -Great! Now that you've learned 2 new ways to run `bash` code from within Jupyter Lab, please answer the following question. What is the _absolute path_ of the default directory of your `bash` shell? When we say "default directory" we mean the folder that you are "in" when you first run `bash` code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory. - -**Relevant topics:** https://the-examples-book.com/starter-guides/unix/pwd[pwd] - -.Items to submit -==== -- The full filepath of the default directory (home directory). Ex: Kevin's is: `/home/x-kamstut` and Dr Ward's is: `/home/x-mdw`. -- The `bash` code used to show your home directory or current directory (also known as the working directory) when the `bash` shell is first launched. -==== - -=== Question 3 - -It is critical to be able to navigate a UNIX-like operating system. It is likely that you will need to use UNIX or Linux (or a similar system) at some point in your career. Perform the following actions, in order, using the `bash` shell. - -[WARNING] -==== -For the sake of consistency, please run your `bash` code using the `%%bash` magic. This ensures that we are all using the correct shell (there are many shells), and that your work is displayed properly for your grader. -==== - -. Write a command to navigate to the directory containing the datasets used in this course: `/anvil/projects/tdm/data`. -. Print the current working directory, is the result what you expected? Output the `$PWD` variable, using the `echo` command. -. List the files within the current working directory (excluding subfiles). -. Without navigating out of `/anvil/projects/tdm/data`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files. -. Return to your home directory. -. Write a command to confirm that you are back in the appropriate directory. - -[NOTE] -==== -`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/x-kamstut` is the _absolute path_ of Kevin's home directory. There is a folder called `home` inside the root `/` directory. Inside `home` is another folder named `x-kamstut`, which is Kevin's home directory. -==== - -**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:echo.adoc[echo], xref:starter-guides:data-science:unix:ls.adoc[ls] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -When running the `ls` command (specifically the `ls` command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: "." and "..". `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/anvil/projects/tdm/data` directory, the `.` refers to the `/anvil/projects/tdm/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/anvil/projects/tdm/data/yelp` directory. - -[source,bash] ----- -ls -la /anvil/projects/tdm/data/yelp/. ----- - -`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/anvil/projects/tdm/data` directory, the `..` refers to the parent directory, `/anvil/projects/tdm`. - -Any path that contains either `.` or `..` is called a _relative path_ (because it is _relative_ to the directory you are currently in). Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_. - -. Write a single command to navigate to our modulefiles directory: `/anvil/projects/tdm/opt/lmod`. -. Confirm that you are in the correct directory using the `echo` command. -. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path. -. Confirm that you are in the corrrect directory using the `echo` command. - -[NOTE] -==== -If you don't fully understand the text above, _please_ take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career. -==== - -**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:special-symbols.adoc[. & .. & ~] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands. - -[source,bash] ----- -cd -cd ~ -cd $HOME -cd /home/$USER ----- - -This is typically where you will work, and where you will store your work (for instance, your completed projects). - -[NOTE] -==== -`$HOME` and `$USER` are environment variables. You can see what they are by typing `echo $HOME` and `echo $USER`. Environment variables are variables that are set by the system, or by the user. To get a list of your terminal session's environment variables, type `env`. -==== - -The `/anvil/projects/tdm` space is a directory created for The Data Mine. It holds our datasets (in the `data` directory), as well as data for many of our corporate partners projects. - -There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located at `/anvil/scratch/$USER`, or, even shorter, `$SCRATCH`. `scratch` is meant for use with _really_ large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command. - -[source,bash] ----- -myquota ----- - -[TIP] -==== -`$SCRATCH` and `$USER` are environment variables. You can see what they are by typing `echo $SCRATCH` and `echo $USER`. `$SCRATCH` contains the absolute path to your scratch directory, and `$USER` contains the username of the current user. -==== - -In a `bash` cell, please perform the following operations. - -. Navigate to your `scratch` directory. -. Confirm that you are in the correct location using a command. -. Execute the `/anvil/projects/tdm/bin/tokei` command, with input `/home/x-kamstut/bin`. -+ -[NOTE] -==== -Doug Crabill is the compute wizard for the Statistics department here at Purdue. `~dgc/bin` is a directory (on a different cluster) he has made publicly available with a variety of useful scripts. I've copied over those files to `~x-kamstut/bin`. -==== -+ -. Output the first 5 lines and last 5 lines of `~x-kamstut/bin/union`. -. Count the number of lines in the bash script `~x-kamstut/bin/union` (using a UNIX command). -. How many bytes is the script? -+ -[CAUTION] -==== -Be careful. We want the size of the script, not the disk usage. -==== -+ -. Find the location of the `python3` command. - -[TIP] -==== -Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. - -[source,bash] ----- -man wc ----- - -You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples. - -[source,bash] ----- -# using the default wc command. "/anvil/projects/tdm/data/flights/1987.csv" is the first "argument" given to the command. -wc /anvil/projects/tdm/data/flights/1987.csv - -# to count the lines, use the -l option -wc -l /anvil/projects/tdm/data/flights/1987.csv - -# to count the words, use the -w option -wc -w /anvil/projects/tdm/data/flights/1987.csv - -# you can combine options as well -wc -w -l /anvil/projects/tdm/data/flights/1987.csv - -# some people like to use a single tack `-` -wc -wl /anvil/projects/tdm/data/flights/1987.csv - -# order doesn't matter -wc -lw /anvil/projects/tdm/data/flights/1987.csv ----- -==== - -**Relevant topics:** xref:starter-guides:data-science:unix:pwd.adoc[pwd], xref:starter-guides:data-science:unix:cd.adoc[cd], xref:starter-guides:data-science:unix:head.adoc[head], xref:starter-guides:data-science:unix:tail.adoc[tail], xref:starter-guides:data-science:unix:wc.adoc[wc], xref:starter-guides:data-science:unix:du.adoc[du], xref:starter-guides:data-science:unix:which.adoc[which], xref:starter-guides:data-science:unix:type.adoc[type] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -Perform the following operations. - -. Navigate to your scratch directory. -. Copy the following file to your current working directory: `/anvil/projects/tdm/data/movies_and_tv/imdb.db`. -. Create a new directory called `movies_and_tv` in your current working directory. -. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch). -. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory. -. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents. -. Remove the file, `im_empty.txt`, from your scratch directory. - -**Relevant topics:** xref:starter-guides:data-science:unix:cp.adoc[cp], xref:starter-guides:data-science:unix:rm.adoc[rm], xref:starter-guides:data-science:unix:touch.adoc[touch], xref:starter-guides:data-science:unix:cd.adoc[cd] - -=== Question 7 - -[IMPORTANT] -==== -This question should be performed by opening a terminal window. menu:File[New > Terminal]. Enter the result/content in a markdown cell in your notebook. -==== - -Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a _really_ useful feature, that you may not know is there unless you are told! - -Here is the way it works, in the most common case -- using `cd`. Have a destination in mind, for example `/anvil/projects/tdm/data/flights/`. Type `cd /anvil/`, and press tab. You should be presented with a small list of options -- the folders in the `anvil` directory. Type `p`, then press tab, and it will complete the word for you. Type `t`, then press tab. Finally, press tab, but this time, press tab repeatedly until you've selected `data`. You can then continue to type and press tab as needed. - -Below is an image of the absolute path of a file in Anvil. Use `cat` and tab completion to print the contents of that file. - -image::figure03.webp[Tab completion, width=792, height=250, loading=lazy, title="Tab completion"] - -.Items to submit -==== -- The content of the file, `hello_there.txt`, in a markdown cell in your notebook. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc deleted file mode 100644 index c0ee7b8dc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project03.adoc +++ /dev/null @@ -1,202 +0,0 @@ -= TDM 20100: Project 3 -- 2022 - -**Motivation:** The need to search files and datasets based on the text held within is common during various parts of the data wrangling process -- after all, projects in industry will not typically provide you with a path to your dataset and call it a day. `grep` is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[even professionals can make critical mistakes]. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in. - -[NOTE] -==== -Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn't be, they can be awesome), you should definitely take the time to learn regular expressions. -==== - -**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python. - -**Scope:** `grep`, regular expression basics, utilizing regular expression tools in R and Python - -.Learning Objectives -**** -- Use `grep` to search for patterns within a dataset. -- Use `cut` to section off and slice up data from the command line. -- Use `wc` to count the number of lines of input. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8973y4i3?wid=_983291"></iframe> -++++ - -`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. - -Let's assume for a second that we _didn't_ provide you with the location of this projects dataset, and you didn't know the name of the file either. With all of that being said, you _do_ know that it is the only dataset with the text "That's the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. - -[TIP] -==== -When you search for this sentence in the file, make sure that you type the single quote in "That's" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique _part_ of the sentence that will likely not exist in another file. -==== - -Write a `grep` command that finds the dataset. You can start in the `/anvil/projects/tdm/data` directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a `con` inside the `/anvil/projects/tdm/data` directory. Just know that you'd _eventually_ find the file without using the wildcard, but we don't want to waste your time. - -[TIP] -==== -Use `man` to read about some of the options with `grep`. For example, you'll want to search _recursively_ through the entire contents of the directories starting with a `con`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_sm3x53u2?wid=_983291"></iframe> -++++ - -In the previous project, you learned about a command that could quickly print out the first _n_ lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and _only_ the first line of the file. - -Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the `complaint_id`, `consumer_complaint_narrative`, and the `state`. Print only the first 100 lines (using `head`) so our notebook is not too full of text. - -Now, use `cat`, `head`, `tail`, and `cut` to isolate those same 3 columns for the _single_ line where we heard about the "fraudy fraudulent fraud". - -[TIP] -==== -You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the `n` option from `grep`. That will tell you the line number, which can then be used with `head` and `tail` to isolate the single line. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8c1b927f?wid=_983291"></iframe> -++++ - -Imagine a scenario where we are dealing with a _much_ bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the `consumer_complaint_narrative`, `state`, `tags`, and `complaint_id`. - -Use UNIX tools to, in one line, create a _new_ dataset called `southeast.csv` that only contains the data for the five states mentioned above, and only the columns listed above. - -[TIP] -==== -Be careful you don't accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL"). -==== - -How many rows of data remain? How many megabytes is the new file? Use `cut` to isolate _just_ the data we ask for. For example, _just_ print the number of rows, and _just_ print the value (in Mb) of the size of the file. - -.this ----- -20M ----- - -.not this ----- --rw-r--r-- 1 x-kamstut x-tdm-admin 20M Dec 13 10:59 /home/x-kamstut/southeast.csv ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ecu7yzmk?wid=_983291"></iframe> -++++ - -We want to isolate some of our southeast complaints. Return rows from our new dataset, `southeast.csv`, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single `grep` command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not). Limit your output to only 5 rows (using `head`). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1d9dwn8b?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xg6wpbfj?wid=_983291"></iframe> -++++ - -If you pay attention to the `consumer_complaint_narrative` column in our new dataset, `southeast.csv`, you'll notice that some of the narratives contain dollar amounts in curly braces `{` and `}`. Use `grep` to find the narratives that contain at least one dollar amount enclosed in curly braces. Use `head` to limit output to only the first 5 results. - -[TIP] -==== -Use the option `-E` to use extended regular expressions. This will make your regular expressions less messy (less escaping). -==== - -[NOTE] -==== -There are instances like `{>= $1000000}` and `{ XXXX }`. The first example qualifies, but the second doesn't. Make sure the following are matched: - -- {$0.00} -- { $1,000.00 } -- {>= $1000000} -- { >= $1000000 } - -And that the following are _not_ matched: - -- { XXX } -- {XXX} -==== - -[TIP] -==== -Regex is hard. Try the following logic. - -. Match a "{" -. Match 0 or more of any character that isn't a-z, A-Z, or 0-9 -. Match 1 or more "$" -. Match 1 or more of any character that isn't "}" -. Match "}" -==== - -[TIP] -==== -To verify your answer, the following code should have the following result. - -[source,bash] ----- -grep -E 'regexhere' $HOME/southeast.csv | head -n 5 | cut -d, -f4 ----- - -.result ----- -3185125 -3184467 -3183547 -3183544 -3182879 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc deleted file mode 100644 index 95d488dae..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project04.adoc +++ /dev/null @@ -1,140 +0,0 @@ -= TDM 20100: Project 4 -- 2022 - -**Motivation:** Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs. - -**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping. - -**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping - -.Learning Objectives -**** -- Use `cut` to section off and slice up data from the command line. -- Use piping to string UNIX commands together. -- Use `sort` and it's options to sort data in different ways. -- Use `head` to isolate n lines of output. -- Use `wc` to summarize the number of lines in a file or in output. -- Use `uniq` to filter out non-unique lines. -- Use `grep` to search files effectively. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/stackoverflow/unprocessed/*` -- `/anvil/projects/tdm/data/stackoverflow/processed/*` -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -[WARNING] -==== -For this project, please submit a `.sh` text file with all of you `bash` code written inside of it. This should be submitted _in addition to_ your notebook (the `.ipynb` file). Failing to submit the accompanying `.sh` file may result and points being removed from your final submission. Thanks! -==== - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_7xn1j7cv?wid=_983291"></iframe> -++++ - -In a csv file, there are n+1 columns where n is the number of commas (in theory). Take the first line of `unprocessed/2011.csv`, replace all commas with the newline character, `\n`, and use `wc` to count the resulting number of lines. This should approximate how many columns are in the dataset. What is the value? - -This can't be right, can it? Print the first 100 lines after using `tr` to replace commas with newlines. What do you notice? - -[TIP] -==== -The newline character in UNIX is `\n`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8bkgq87y?wid=_983291"></iframe> -++++ - -As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in `processed/2011.csv`. How many columns are there? - -Take a look at `iowa_liquor_sales_cleaner.txt` -- how many columns does that file have? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_h50hc11a?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_shmmqrtb?wid=_983291"></iframe> -++++ - -Continuing to look at the `iowa_liquor_sales_cleaner.txt` dataset, what are the 5 largest orders by number of bottles sold? How about by Gallons sold? - -[TIP] -==== -`cat`, `cut`, `sort`, and `head` will be useful. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jujp467m?wid=_983291"></iframe> -++++ - -What are the different sizes (in ml) that a bottle of liquor comes in? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ie2xt65f?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_v7vm4kov?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wkdg894n?wid=_983291"></iframe> -++++ - -https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] states that the leading digit in real-life sets of numerical data, the leading digit is likely to follow a distinct distribution (see the plot in the https://en.wikipedia.org/wiki/Benford%27s_law[provided link]). By this logic, the dollar amount in the orders should roughly match this, right? - -Use any available `bash` tools you'd like to get a good idea of the count or percentage of the sales (in dollars) by starting digit. Are the results expected? Could there be some "funny business" going on? Write 1-2 sentences explaining what you think. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc deleted file mode 100644 index 36e15ef75..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project05.adoc +++ /dev/null @@ -1,169 +0,0 @@ -= TDM 20100: Project 5 -- 2022 - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the first of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_y7xudoq5?wid=_983291"></iframe> -++++ - -While the UNIX tools we've used up to this point are very useful, `awk` enables many new capabilities, and can even replace major functionality of other tools. - -In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using `awk`. - -Similarly, we've used `head` to print the header line. Use `awk` to do the same. - -Similarly, we've used `wc` to count the number of lines in the dataset. Use `awk` to do the same. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8jcag67t?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_0r817w8p?wid=_983291"></iframe> -++++ - -In a previous question, we used `sort` in combination with `uniq` to find the stores with the most number of sales. - -Use `awk` to find the 10 stores with the most number of sales. In a previous solution, our output was minimal -- we had a count and a store number. This time, take some time to format the output nicely, _and_ use the store number to find the count (not store name). - -[TIP] -==== -Sorting an array by values in `awk` can be confusing. Check out https://stackoverflow.com/questions/5342782/sort-associative-array-with-awk[this excellent stackoverflow post] to see a couple of ways to do this. "Edit 2" is the easiest one to follow. -==== - -[NOTE] -==== -You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to _print_ the store names with the output. -==== - -[TIP] -==== -You can pipe output to the `column` unix command to get neatly formatted output! - -[source,bash] ----- -man column ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_l7dc748w?wid=_983291"></iframe> -++++ - -Calculate the total sales (in USD). Do this using _only_ `awk`. - -[TIP] -==== -`gsub` is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$`'s in field 2 with nothing by: - ----- -gsub(/\$/, "", $2) ----- -==== - -[NOTE] -==== -The `gsub` operation happens in-place. In a nutshell, what this means is that the original field, `$2` is replaced with the result of the `gsub` operation (which removes the dollar signs). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ilj8mxg5?wid=_983291"></iframe> -++++ - -Calculate the total sales (in USD) _by county_. Do this using _only_ `awk`. Format your output so it looks like the following. - -.output ----- -FRANKLIN: $386729.06 -HARRISON: $401811.83 -Franklin: $2102880.14 -Harrison: $2109578.24 ----- - -Notice anything odd about the result? Look carefully at the dataset and suggest an alternative method that would clean up the issue. - -[TIP] -==== -You can see the issue in our tiny sample of output. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_mgmmketg?wid=_983291"></iframe> -++++ - -`awk` is extremely powerful, and this liquor dataset is pretty interesting! We haven't covered everything `awk` (and we won't). - -Look at the dataset and ask yourself an interesting question about the data. Use `awk` to solve your problem (or, at least, get you closer to answering the question). Explore various stackoverflow questions about `awk` and `awk` guides online. Try to incorporate an `awk` function you haven't used, or a `awk` trick you haven't seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new. - -[NOTE] -==== -You do not need to limit yourself to _just_ use `awk`, but try to do as much using just `awk` as you are able. -==== - -.Items to submit -==== -- A markdown cell containing the question you are trying to answer about the dataset. -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc deleted file mode 100644 index 3fe6b2d2b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project06.adoc +++ /dev/null @@ -1,112 +0,0 @@ -= TDM 20100: Project 6 -- 2022 - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the second of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt` -- `/anvil/projects/tdm/data/donorschoose/Donations.csv` -- `/anvil/projects/tdm/data/whin/weather.csv` - -== Questions - -=== Question 1 - -Use `awk` to determine how many columns and rows are in the following dataset: `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt`. - -Make sure the output is formatted as follows. - -.output ----- -rows: 12345 -columns: 12345 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -What are the possible "conditions" of the vehicles being sold: `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt`? Use `awk` to answer this question. How many cars of each condition are in the dataset? Make sure to format the output as follows. - -.output ----- -Condition Number of cars ---------- -------------- -AAA 12345 -bb 99999 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Use `awk` to determine the years (for example, 2020, 2021, etc) of the donations in the dataset: `/anvil/projects/tdm/data/donorschoose/Donations.csv`? - -[TIP] -==== -The https://thomas-cokelaer.info/blog/2011/05/awk-the-substr-command-to-select-a-substring/[`substr`] function in `awk` will be useful. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Use `awk` to determine the total donations (in dollars) by year: `/anvil/projects/tdm/data/donorschoose/Donations.csv`? - -Use `printf` and the unix `column` utility to format the output as follows. - -.output ----- -Year Donations in dollars -2020 $1234.56 -2021 $9999.99 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Use `awk` to determine the average `temperature_high` by month: `/anvil/projects/tdm/data/whin/weather.csv`. Make sure the output is sorted by month (you can use `sort` for that). If you are feeling adventurous, try and use `awk` to output a horizontal bar plot using just ascii and `awk`. This last part is _not_ required, but could be fun. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc deleted file mode 100644 index e309669e0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project07.adoc +++ /dev/null @@ -1,351 +0,0 @@ -= TDM 20100: Project 7 -- 2022 -:page-mathjax: true - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the third of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_r1rcuxol?wid=_983291"></iframe> -++++ - -Take a look at the dataset. You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful. - -The data will look something like the following: - ----- -Store Location -POINT (-91.716615 41.963516) -POINT (-91.6537 41.987286) -POINT (-91.52888 40.962331000000006) -POINT (-93.596755 41.5464) -POINT (-91.658105 42.010971) -POINT (-91.494611 41.807199) - -POINT (-91.796988 43.307662) -POINT (-91.358467 41.280183) ----- - -What this means is that you can't just parse out the latitude and longitude coordinates and call it a day -- you need to use `awk` functions like `gsub` and `split` to extract the latitude and longitude coordinates. - -Use `awk` to print out the latitude and longitude for each line in the original dataset. Output should resemble the following. - ----- -lat;lon -1.23;4.56 ----- - -[NOTE] -==== -Make sure to take care of rows that don't have latitude and longitude coordinates -- just skip them. So if your results look like this, you need to add logic to skip the "empty" rows: - ----- - --91.716615 41.963516 --91.6537 41.987286 --91.52888 40.962331000000006 --93.596755 41.5464 --91.658105 42.010971 --91.494611 41.807199 - --91.796988 43.307662 --91.358467 41.280183 ----- - -To do this, just go ahead and wrap your print in an if statement similar to: - -[source,awk] ----- -if (length(coords[1]) > ) { - print coords[1]";"coords[2] -} ----- -==== - -[TIP] -==== -`split` and `gsub` will be useful `awk` functions to use for this question. -==== - -[TIP] -==== -If we have a bunch of data formatted like the following: - ----- -POINT (-91.716615 41.963516) ----- - -If we first used `split` to split on "(", for example like: - -[source,awk] ----- -split($8, coords, "("); ----- - -`coords[2]` would be: - ----- --91.716615 41.963516) ----- - -Then, you could use `gsub` to remove any ")" characters from `coords[2]` like: - -[source,awk] ----- -gsub(/\)/, "", coords[2]); ----- - -`coords[2]` would be: - ----- --91.716615 41.963516 ----- - -At this point I'm sure you can see how to use `awk` to extract and print the rest! -==== - -[IMPORTANT] -==== -Don't forget any lingering space after the first comma! We don't want that. -==== - -[IMPORTANT] -==== -To verify your `awk` command is correct, pipe the first 10 rows to your `awk` command. The output should be the following. - -[source,ipython] ----- -%%bash - -head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F';' '{}' ----- - -.output ----- -41.963516;-91.716615 -41.987286;-91.6537 -40.962331000000006;-91.52888 -41.5464;-93.596755 -42.010971;-91.658105 -41.807199;-91.494611 -43.307662;-91.796988 -41.280183;-91.358467 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_rja4tun7?wid=_983291"></iframe> -++++ - -Use `awk` to create a new dataset called `sales_by_store.csv`. Include the `lat` and `lon` you figured out how to parse in the previous question. The final columns should be the following. - -.columns ----- -store_name;date;sold_usd;volume_sold;lat;lon ----- - -Please exclude all rows that do not have latitude and longitude values. Save volume sold as liters, not gallons. - -[TIP] -==== -You can output the results of the `awk` command to a new file called `sales_by_store.csv` as follows. - -[source,ipython] ----- -%%bash - -awk -F';' {} /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > $HOME/sales_by_store.csv ----- - -The `>` part is a _redirect_. You are redirecting the output from the `awk` command to a new file called `sales_by_store.csv`. If you were to replace `>` by `>>` it would _append_ instead of _replace_. In other words, if you use a single `>` it will first erase the `sales_by_store.csv` file before adding the results of the `awk` command to the file. If you use `>>`, it will append the results. If you use `>>` and append results -- if you were to run the command more than once, the `sales_by_store.csv` file would continue to grow. -==== - -[TIP] -==== -To verify your output, the results from piping the first 10 lines of our dataset to your `awk` command should be the following. - -[source,ipython] ----- -%%bash - -head -n 10 /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F';' '{}' ----- - -.output ----- -store_name;date;sold_usd;volume_sold;lat;lon -CVS PHARMACY #8443 / CEDAR RAPIDS;08/16/2012;5.25;41.963516;-91.716615 -SMOKIN' JOE'S #6 TOBACCO AND LIQUOR;09/10/2014;9;41.987286;-91.6537 -HY-VEE FOOD STORE / MOUNT PLEASANT;04/10/2013;1.5;40.962331000000006;-91.52888 -AFAL FOOD & LIQUOR / DES MOINES;08/30/2012;1.12;41.5464;-93.596755 -HY-VEE FOOD STORE #5 / CEDAR RAPIDS;01/26/2015;3;42.010971;-91.658105 -SAM'S MAINSTREET MARKET / SOLON;07/19/2012;12;41.807199;-91.494611 -DECORAH MART;10/23/2013;9;43.307662;-91.796988 -ECON-O-MART / COLUMBUS JUNCTION;05/02/2012;2.25;41.280183;-91.358467 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_u38kx59v?wid=_983291"></iframe> -++++ - -Believe it or not, `awk` even supports geometric calculations like `sin` and `cos`. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points. - -Okay, so how to get started? To calculate this, we can use https://en.wikipedia.org/wiki/Haversine_formula[the Haversine formula]. The formula is: - -$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$ - -Where: - -- $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers -- $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points -- $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points - -In `awk`, `sin` is `sin`, `cos` is `cos`, and `sqrt` is `sqrt`. - -To get the `arcsin` use the following `awk` function: - -[source,awk] ----- -function arcsin(x) { return atan2(x, sqrt(1-x*x)) } ----- - -To convert from degrees to radians, use the following `awk` function: - -[source,awk] ----- -function dtor(x) { return x*atan2(0, -1)/180 } ----- - -The following is how the script should work (with a real example you can test): - -[source,ipython] ----- -%%bash - -./question3.sh 40.39978 -91.387531 40.739238 -95.02756 ----- - -.Results ----- -309.57 ----- - -[TIP] -==== -To include functions in your `awk` command, do as follows: - -[source,bash] ----- -awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{ - lat1 = dtor(lat1); - print lat1; - # rest of your code here! -}' ----- -==== - -[TIP] -==== -We want you to create a bash script called `question3.sh` in your `$HOME` directory. After you have your bash script, we want you to run it in a bash cell to see the output. - -The following is some skeleton code that you can use to get started. - -[source,bash] ----- -#!/bin/bash - -lat1=$1 -lat2=$3 -lon1=$2 -lon2=$4 - -awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{ - lat1 = dtor(lat1); - print lat1; - # rest of your code here! -}' ----- -==== - -[TIP] -==== -You may need to give your script execute permissions like this. - -[source,ipython] ----- -%%bash - -chmod +x $HOME/question3.sh ----- -==== - -[TIP] -==== -Read the https://the-examples-book.com/starter-guides/unix/scripts#shebang[shebang] and https://the-examples-book.com/starter-guides/unix/scripts#arguments[arguments] sections in the book. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_m6fgshy9?wid=_983291"></iframe> -++++ - -Find the latitude and longitude points for two interesting points on a map (it could be anywhere). Make a note of the locations and the latitude and longitude values for each point in a markdown cell. - -Use your `question.sh` script to determine the distance. How close is the distance to the distance you get from an online map app? Pretty close? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc deleted file mode 100644 index d911ecc19..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project08.adoc +++ /dev/null @@ -1,213 +0,0 @@ -= TDM 20100: Project 8 -- 2022 - -**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, https://cloudflare.com[Cloudflare], a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per https://news.ycombinator.com/item?id=22878136[this thread on hackernews]). Learning SQL is well worth your time! - -**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite. - -**Scope:** SQL, sqlite - -.Learning Objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xqnw535y?wid=_983291"></iframe> -++++ - -Get started by taking a look at the available tables in the database. What tables are available? - -[TIP] -==== -You'll want to prepend `%%sql` to the top of the cell -- it should be the very first line of the cell (no comments or _anything_ else before it). - -[source,ipython] ----- -%%sql - --- Query here ----- -==== - -[TIP] -==== -In sqlite, you can show the tables using the following query: - -[source, sql] ----- -.tables ----- - -Unfortunately, sqlite-specific functions can't be run in a Jupyter Lab cell like that. Instead, we need to use a different query. - -[source, sql] ----- -SELECT tbl_name FROM sqlite_master WHERE type='table'; ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_rd3rx3rx?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_2ckyo1pr?wid=_983291"></iframe> -++++ - -It's always a good idea to get an idea what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table. Write and run 6 queries that return the first 5 rows of data of each table. - -To get a better idea of the size of the data, you can use the `count` clause to get the number of rows in each table. Write an run 6 queries that returns the number of rows in each table. - -[TIP] -==== -Run each query in a separate cell, and remember to limit the query to return only 5 rows each. - -You can use the `limit` clause to limit the number of rows returned. -==== - -**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-the-first-5-rows-of-the-employees-table[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_7b1r2arj?wid=_983291"></iframe> -++++ - -This dataset contains movie data from https://imdb.com (an Amazon company). As you can probably guess, it would be difficult to load the data from those tables into a nice, neat dataframe -- it would just take too much memory on most systems! - -Okay, let's dig into the `titles` table a little bit. Run the following query. - -[source, sql] ----- -SELECT * FROM titles LIMIT 5; ----- - -As you can see, every row has a `title_id` for the associated title of a movie or tv show (or other). What is this `title_id`? Check out the following link: - -https://www.imdb.com/title/tt0903747/ - -At this point, you may suspect that it is the id imdb uses to identify a movie or tv show. Well, let's see if that is true. Query our database to get any matching titles from the `titles` table matching the `title_id` provided in the link above. - -[TIP] -==== -The `WHERE` clause can be used to filter the results of a query. -==== - -**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_q95ke71x?wid=_983291"></iframe> -++++ - -That is pretty cool! Not only do you understand what the `title_id` means _inside_ the database -- but now you know that you can associate a web page with each `title_id` -- for example, if you run the following query, you will get a `title_id` for a "short" called "Carmencita". - -[source, sql] ----- -SELECT * FROM titles LIMIT 5; ----- - -.Output ----- -title_id, type, ... -tt0000001, short, ... ----- - -If you navigate to https://www.imdb.com/title/tt0000001/, sure enough, you'll see a neatly formatted page with data about the movie! - -Okay great. Now, if you take a look at the `episodes` table, you'll see that there are both an `episode_title_id` and `show_title_id` associated with each row. - -Let's try and make sense of this the same way we did before. Write a query using the `WHERE` clause to find all rows in the `episodes` table where `episode_title_id` is `tt0903747`. What did you get? - -Now, write a query using the `WHERE` clause to find all rows in the `episodes` table where `show_title_id` is `tt0903747`. What did you get? - -**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hwj5ffz9?wid=_983291"></iframe> -++++ - -Very interesting! It looks like we didn't get any results when we queried for `episode_title_id` with an id of `tt0903747`, but we did for `show_title_id`. This must mean these ids can represent both a _show_ as well as the _episode_ of a show. By that logic, we should be able to find the _title_ of one of the Breaking Bad episodes, in the same way we found the title of the show itself, right? - -Okay, take a look at the results of your second query from question (4). Choose one of the `episode_title_id` values, and query the `titles` table to find the title of that episode. - -Finally, in a browser, verify that the title of the episode is correct. To verify this, take the `episode_title_id` and plug it into the following link. - -https://www.imdb.com/title/<episode_title_id>/ - -So, I used `tt1232248` for my query. I would check to make sure it matches this. - -https://www.imdb.com/title/tt1232248/ - -**Relevant topics:** xref:programming-languages:SQL:queries.adoc#examples[queries], xref:programming-languages:SQL:queries.adoc#using-the-sqlite-chinook-database-here-select-only-employees-with-the-first-name-steve-or-last-name-laura[useful example] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc deleted file mode 100644 index 18f16b4f7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project09.adoc +++ /dev/null @@ -1,157 +0,0 @@ -= TDM 20100: Project 9 -- 2022 - -**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data. - -**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `MIN`, and `MAX`. - -**Scope:** SQL, sqlite - -.Learning Objectives -**** -- Explain the advantages and disadvantages of using a database over a tool like a spreadsheet. -- Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, etc. -- Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/taxi/taxi_sample.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wxzavjdp?wid=_983291"></iframe> -++++ - -In previous projects, we used `awk` to parse through and summarize data. While `awk` is extremely convenient and can work well, but SQL is even better. - -Write a query that will return the `fare_amount`, `surcharge`, `tip_amount`, and `tolls_amount` as a percentage of `total_amount`. - -[IMPORTANT] -==== -Make sure to limit the output to only 100 rows! Use the `LIMIT` clause to do this. -==== - -[TIP] -==== -Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_nw3ug0qu?wid=_983291"></iframe> -++++ - -Check out the `payment_type` column. Write a query that counts the number of each type of `payment_type`. The end result should print something like the following. - -.Output sample ----- -payment_type, count -CASH, 123 ----- - -[TIP] -==== -You can use aliasing to control the output header names. -==== - -Write a query that sums the `total_amount` for `payment_type` of "CASH". What is the total amount of cash payments? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xcd58b60?wid=_983291"></iframe> -++++ - -Write a query that gets the largest number of passengers in a single trip. How far was the trip? What was the total amount? Answer all of this in a single query. - -Whoa, there must be some erroneous data in the database! Not too surprising. Write a query that explores this more, explain what your query does and how it helps you understand what is going on. - -[IMPORTANT] -==== -Make sure all queries limit output to only 100 rows. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_i5jqphga?wid=_983291"></iframe> -++++ - -Write a query that gets the average `total_amount` for each year in the database. Which year has the largest average `total_amount`? Use the `pickup_datetime` column to determine the year. - -[TIP] -==== -Read https://www.sqlite.org/lang_datefunc.html[this] page and look at the strftime function. -==== - -[TIP] -==== -If you want the headers to be more descriptive, you can use aliases. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_tjope0og?wid=_983291"></iframe> -++++ - -What percent of data in our database has information on the _location_ of pickup and dropoff? Examine the data, to see if there is a pattern to the rows _with_ that information and _without_ that information. - -[TIP] -==== -There _is_ a distinct pattern. Pay attention to the date and time of the data. -==== - -Confirm your hypothesis with the original data set(s) (in `/anvil/projects/tdm/data/taxi/yellow/*.csv`), using bash. This doesn't have to be anything more thorough than running a simple `head` command with a 1-2 sentence explanation. - -[TIP] -==== -Of course, there will probably be some erroneous data for the latitude and longitude columns. However, you could use the `avg` function on a latitude or longitude column, by _year_ to maybe get a pattern. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc deleted file mode 100644 index 97f821cfd..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project10.adoc +++ /dev/null @@ -1,275 +0,0 @@ -= TDM 20100: Project 10 -- 2022 - -**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like `MIN`, `MAX`, and `AVG` in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values. - -**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values! - -**Scope:** SQL, SQL in R - -.Learning Objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc. -- Showcase the ability to filter, alias, and write subqueries. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_wi4b0jwc?wid=_983291"></iframe> -++++ - -Let's say we are interested in the Marvel Cinematic Universe (MCU). We could write the following query to get the titles of all the movies in the MCU (at least, available in our database). - -[source, sql] ----- -SELECT premiered, COUNT(*) FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') GROUP BY premiered; ----- - -The result would be a perfectly good-looking table. Now, with that being said, are the headers good-looking? Is it clear what data each column contains? I don't know about you, but `COUNT(*)` as a header is not very clear. xref:programming-languages:SQL:aliasing.adoc[Aliasing] is a great way to not only make the headers look good, but it can also be used to reduce the text in a query by giving some intermediate results a shorter name. - -Fix the query so that the headers are `year` and `movie count`, respectively. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5qsrgrv8?wid=_983291"></iframe> -++++ - -Okay, let's say we are interested in modifying our query from question (1) to get the _percentage_ of MCU movies released in each year. Essentially, we want the count for each group, divided by the total count of all the movies in the MCU. - -We can achieve this using a _subquery_. A subquery is a query that is used to get a smaller result set from a larger result set. - -Write a query that returns the total count of the movies in the MCU, and then use it as a subquery to get the percentage of MCU movies released in each year. - -[TIP] -==== -You do _not_ need to change the query from question (1), rather, you just need to _add_ to the query. -==== - -[TIP] -==== -You can directly divide `COUNT(*)` from the original query by the subquery to get the result! -==== - -[WARNING] -==== -Your initial result may seem _very_ wrong (no fractions at all!) this is OK -- we will fix this in the next question. -==== - -[IMPORTANT] -==== -Use aliasing to rename the new column to `percentage`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_xrh1s5a2?wid=_983291"></iframe> -++++ - -Okay, if you did question (2) correctly, you should have got a result that looks a lot like: - -.Output ----- -year,movie count,percentage -2008, 2, 0 -2010, 1, 0 -2011, 2, 0 -... ----- - -What is going on? - -The `AS` keyword can _also_ be used to _cast_ types. Some of you may or may not be familiar with a feature of many programming languages. Common in many programming languages is an "integer" type -- which is for numeric data _without_ a decimal place, and a "float" type -- which is for numeric data _with_ a decimal place. In _many_ languages, if you were to do the following, you'd get what _may_ be unexpected output. - -[source,c] ----- -9/4 ----- - -.Output ----- -2 ----- - -Since both of the values are integers, the result will truncate the decimal place. In other words, the result will be 2, instead of 2.25. - -In Python, they've made changes so this doesn't happen. - -[source,python] ----- -9/4 ----- - -.Output ----- -2.25 ----- - -However, if we want the "regular" functionality we can use the `//` operator. - -[source,python] ----- -9//4 ----- - -.Output ----- -2 ----- - -Okay, sqlite does this as well. - -[source, sql] ----- -SELECT 9/4 as result; ----- - -.Output ----- -result -2 ----- - -_This_ is why we are getting 0's for the percentage column! - -How do we fix this? The following is an example. - -[source, sql] ----- -SELECT CAST(9 AS real)/4 as result; ----- - -.Output ----- -result -2.25 ----- - -[NOTE] -==== -Here, "real" represents "float" or "double" -- it is another way of saying a number with a decimal place. -==== - -[IMPORTANT] -==== -When you do arithmetic with an integer and a real/float, the result will be a real/float. This is why our result is a real even though 50% of our values are integers. -==== - -Fix the query so that the results look something like: - -.Output ----- -year, movie count, percentage -2008, 2, 0.0689... -2010, 1, 0.034482... -2011, 2, 0.0689... ----- - -[NOTE] -==== -You can read more about `sqlite3` types https://www.sqlite.org/datatype3.html[here]. In a lot of ways, the `sqlite3` typing system is simpler than typical RDBMS systems, and it other ways it is more complex. `sqlite3` considers their flexible typing https://www.sqlite.org/flextypegood.html[a feature]. However, `sqlite3` does provide https://www.sqlite.org/stricttables.html[strict tables] for individuals who want a more stringent set of typing rules. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1mozikhs?wid=_983291"></iframe> -++++ - -You now know 2 different applications of the `AS` keyword, and you also know how to use a query as a subquery, great! - -In the previous project, we were introduced to aggregate functions. We used the GROUP BY clause to group our results by the `premiered` column in this project too! We know we can use the `WHERE` clause to filter our results, but what if we wanted to filter our results based on an aggregated column? - -Modify our query from question (3) to print only the rows where the `movie count` is greater than 2. - -[TIP] -==== -See https://www.geeksforgeeks.org/having-vs-where-clause-in-sql/[this article] for more information on the `HAVING` and `WHERE` clauses. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_g0qo4yxu?wid=_983291"></iframe> -++++ - -Write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3. - -Look at the results. Which year had the lowest average number of words in the `primary_title` column (no need to write another query for this, just eyeball it)? - -[TIP] -==== -See https://stackoverflow.com/questions/3293790/query-to-count-words-sqlite-3[here]. Replace "@String" with the column you want to count the words in. -==== - -[TIP] -==== -If you got it right, there should be 15 rows in the output. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc deleted file mode 100644 index 69c83325e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project11.adoc +++ /dev/null @@ -1,149 +0,0 @@ -= TDM 20100: Project 11 -- 2022 - -**Motivation:** Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious. - -**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables. - -**Scope:** SQL, sqlite, joins - -.Learning Objectives -**** -- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. -- Showcase the ability to filter, alias, and write subqueries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_tp3o35rt?wid=_983291"></iframe> -++++ - -In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year. - -Now that we are learning about _joins_, we have the ability to make much more interesting queries! - -Use the provided list of `title_id` values to get a list of the MCU movie `primary_title` values, `premiered` values, and rating (from the provided list of MCU movies). - -Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list). - -.List of MCU title_ids ----- -('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286') ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_aspvz5jh?wid=_983291"></iframe> -++++ - -Run the following query. - -[source,ipython] ----- -%%sql - -SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286'); ----- - -Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1). - -Write a query that returns the `primary_title` of those movies _not_ shown in the result of question (1) but that _are_ shown in the result of the query above. You can use the query in question (1) as a subquery to answer this. - -Can you notice a pattern to said movies? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_hqa1abza?wid=_983291"></iframe> -++++ - -In the previous questions we explored what is _actually_ the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1). - -Modify your query from question (1) so that it returns a list of _all_ MCU movies with their associated rating, regardless of whether or not the movie has a rating. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_di87hxgn?wid=_983291"></iframe> -++++ - -In the previous project, question (5) asked you to write a query that returns the average number of words in the `primary_title` column, by year, and only for years where the average number of words in the `primary_title` is less than 3. - -Okay, great. What would be more interesting would be to see the average number of words in the `primary_title` column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have? - -Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_nhev4n5v?wid=_983291"></iframe> -++++ - -We have a fun database, and you've learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn't previously (easily) tackle, and answers a question you are interested in. - -Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc deleted file mode 100644 index 20d06752f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project12.adoc +++ /dev/null @@ -1,342 +0,0 @@ -= TDM 20100: Project 12 -- 2022 - -**Motivation:** In the previous projects, you've gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we _haven't_ yet done is creating your _own_ database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and _many_ of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets. - -**Context:** This is the final project for the semester, and we will be walking through the useful skill of creating a database and populating it with data. We will (mostly) be using the https://www.sqlite.org/[sqlite3] command line tool to interact with the database. - -**Scope:** sql, sqlite, unix - -.Learning Objectives -**** -- Create a sqlite database schema. -- Populate the database with data using `INSERT` statements. -- Populate the database with data using the command line interface (CLI) for sqlite3. -- Run queries on a database. -- Create an index to speed up queries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/2007.csv` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_7ctatp8n?wid=_983291"></iframe> -++++ - -[WARNING] -==== -For any questions requiring a screenshot be included in your notebook, follow the method described https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[here] in order to add a screenshot to your notebook. -==== - -First thing is first, create a new Jupyter Notebook called `firstname-lastname-project12.ipynb`. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the `sqlite3` command line tool from the terminal window. - -Okay, once completed, the first step is schema creation. First, it is important to note. **The goal of this project is to put the data in `/anvil/projects/tdm/data/flights/subset/2007.csv` into a sqlite database we will call `firstname-lastname-project12.db`.** - -With that in mind, run the following (in your terminal) to get a sample of the data. - -[source,bash] ----- -head /anvil/projects/tdm/data/flights/subset/2007.csv ----- - -You _should_ receive a result like: - -.Output ----- -Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay -2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0 -2007,1,1,1,1918,1905,2043,2035,WN,462,N370,85,90,74,8,13,SMF,PDX,479,5,6,0,,0,0,0,0,0,0 -2007,1,1,1,2206,2130,2334,2300,WN,1229,N685,88,90,73,34,36,SMF,PDX,479,6,9,0,,0,3,0,0,0,31 -2007,1,1,1,1230,1200,1356,1330,WN,1355,N364,86,90,75,26,30,SMF,PDX,479,3,8,0,,0,23,0,0,0,3 -2007,1,1,1,831,830,957,1000,WN,2278,N480,86,90,74,-3,1,SMF,PDX,479,3,9,0,,0,0,0,0,0,0 -2007,1,1,1,1430,1420,1553,1550,WN,2386,N611SW,83,90,74,3,10,SMF,PDX,479,2,7,0,,0,0,0,0,0,0 -2007,1,1,1,1936,1840,2217,2130,WN,409,N482,101,110,89,47,56,SMF,PHX,647,5,7,0,,0,46,0,0,0,1 -2007,1,1,1,944,935,1223,1225,WN,1131,N749SW,99,110,86,-2,9,SMF,PHX,647,4,9,0,,0,0,0,0,0,0 -2007,1,1,1,1537,1450,1819,1735,WN,1212,N451,102,105,90,44,47,SMF,PHX,647,5,7,0,,0,20,0,0,0,24 ----- - -An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types -- just more "set in stone" than the very easily changed data.frame. - -Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the `sqlite_schema` table will contain the type, name, tbl_name, rootpage, and sql for the database object. - -[NOTE] -==== -For sqlite, the "database object" could refer to a table, index, view, or trigger. -==== - -This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is https://www.sqlite.org/schematab.html[here]. - -For _our_ purposes, when I refer to "schema", what I _really_ mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the _exact_ commands to build the database from scratch _without_ the data itself. For example, take a look at our `imdb.db` database by running the following in your terminal. - -[source,bash] ----- -sqlite3 /anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -This will open the command line interface (CLI) for sqlite3. It will look similar to: - -[source,bash] ----- -sqlite> ----- - -Type `.schema` to see the "schema" for the database. - -[NOTE] -==== -Any command you run in the sqlite CLI that starts with a dot (`.`) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing `.help`. -==== - -After running `.schema`, you should see a variety of legitimate SQL commands that will create the structure of your database _without_ the data itself. This is an extremely useful self-documenting tool that is particularly useful. - -Okay, great. Now, let's study the sample of our `2007.csv` dataset. Create a markdown list of key:value pairs for each column in the dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column. - -For example: - -- Year: INTEGER - -Where the _value_ is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" https://www.sqlite.org/datatype3.html[here]. - -Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are _approximate_ numeric data types and _exact_ numeric data types. What you are most familiar with is the _approximate_ numeric data types. In R or Python for example, try running the following: - -[source,r] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -FALSE ----- - -[source,python] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -False ----- - -Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is _critical_ that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for _our_ example, store text as TEXT, numbers _without_ decimal places as INTEGER, and numbers with decimal places as REAL -- our example dataset doesn't have a need for NUMERIC. - -.Items to submit -==== -- Screenshot showing the `sqlite3` output when running `.schema` on the `imdb.db` database. -- A markdown cell containing a list of key value pairs that describe a type for each column in the `2007.csv` dataset. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_jvyfouts?wid=_983291"></iframe> -++++ - -Okay, great! At this point in time you should have a list of key:value pairs with the column name and the data type, for each column. Now, let's put together our `CREATE TABLE` statement that will create our table in the database. - -See https://www.sqlitetutorial.net/sqlite-create-table/[here] for some good examples. Realize that the `CREATE TABLE` statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table `flights`. - -Once you've written your `CREATE TABLE` statement, create a new, empty database by running the following in a terminal: `sqlite3 $HOME/flights.db`. Copy and paste the `CREATE TABLE` statement into the sqlite CLI. Upon success, you should see the statement printed when running the dot command `.schema`. Fantastic! You can also verify that the table exists by running the dot command `.tables`. - -Congratulations! To finish things off, please paste the `CREATE TABLE` statement into a markdown cell in your notebook. In addition, include a screenshot of your `.schema` output after your `CREATE TABLE` statement was run. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_7k8nx3e3?wid=_983291"></iframe> -++++ - -The next step in the project is to add the data! After all, it _is_ a _data_ base. - -To insert data into a table _is_ a bit cumbersome. For example, let's say we wanted to add the following row to our `flights` table. - -.Data to add ----- -Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay -2007,1,1,1,1232,1225,1341,1340,WN,2891,N351,69,75,54,1,7,SMF,ONT,389,4,11,0,,0,0,0,0,0,0 ----- - -The SQL way would be to run the following query. - -[source, sql] ----- -INSERT INTO flights (Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay) VALUES (2007,1,1,1,1232,1225,1341,1340,'WN',2891,'N351',69,75,54,1,7,'SMF','ONT',389,4,11,0,,0,0,0,0,0,0); ----- - -NOT ideal -- especially since we have over 7 million rows to add! You could programmatically generate a `.sql` file with the `INSERT INTO` statement, hook the database up with Python or R and insert the data that way, _or_ you could use the wonderful dot commands sqlite already provides. - -Insert the data from `2007.csv` into your `flights.db` database. You may find https://stackoverflow.com/questions/13587314/sqlite3-import-csv-exclude-skip-header[this post] very helpful. - -[WARNING] -==== -You want to make sure you _don't_ include the header line twice! If you included the header line twice, you can verify by running the following in the sqlite CLI. - -[source,sql] ----- -.header on -SELECT * FROM flights LIMIT 2; ----- - -The `.header on` dot command will print the header line for every query you run. If you have double entered the header line, it will appear twice. Once for the `.header on` and another time because that is the first row of your dataset. -==== - -Connect to your database in your Jupyter notebook and run a query to get the first 5 rows of your table. - -[TIP] -==== -To connect to your database: - -[source,ipython] ----- -%sql sqlite:///$HOME/flights.db ----- -==== - -.Items to submit -==== -- An `sql` cell in your notebook that connects to your database and runs a query to get the first 5 rows of your table. -- Output from running the code. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_ybwwym37?wid=_983291"></iframe> -++++ - -Woohoo! You've successfully created a database and populated it with data from a dataset -- pretty cool! Connect to your databse from inside a terminal. - -[source,bash] ----- -sqlite3 $HOME/flights.db ----- - -Now, run the following dot command in order to _time_ our queries: `.timer on`. This will print out the time it takes to run each query. For example, try the following: - -[source, sql] ----- -SELECT * FROM flights LIMIT 5; ----- - -Cool! Time the following query. - -[source, sql] ----- -SELECT * FROM flights ORDER BY DepTime LIMIT 1000; ----- - -.Output ----- -Run Time: real 1.824 user 0.836007 sys 0.605384 ----- - -That is pretty quick, but if (for some odd reason) there were going to be a lot of queries that searched on exact departure times, this could be a big waste of time when done at scale. What can we do to improve this? Add and index! - -Run the following query. - -[source, sql] ----- -EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232; ----- - -The output will indicate that the "plan" is to simply scan the entire table. This has a runtime of O(n), which means the speed is linear to the number of values in the table. If we had 1 million rows and it takes 1 second. If we get to a billion rows, it will take 16 minutes! An _index_ is a data structure that will let us reduce the runtime to O(log(n)). This means if we had 1 million rows and it takes 1 second, if we had 1 billion rows, it would take only 3 seconds. _Much_ more efficient! So what is the catch here? Space. - -Leave the sqlite CLI by running `.quit`. Now, see how much space your `flights.db` file is using. - -[source,bash] ----- -ls -la $HOME/flights.db ----- - -.Output ----- -545M ----- - -Okay, _after_ I add an index on the `DepTime` column, the file is now `623M` -- while that isn't a _huge_ difference, it would certainly be significant if we scaled up the size of our database. In this case, another drawback would be the insert time. Inserting new data into the database would force the database to have to _update_ the indexes. This can add a _lot_ of time. These are just tradeoffs to consider when you're working with a database. - -In this case, we don't care about the extra bit of space -- create an index on the `DepTime` column. https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-indexes-indexes-c4e175f3c346[This article] is a nice easy read that covers this in more detail. - -Great! Once you've created your index, run the following query. - -[IMPORTANT] -==== -Make sure you turn on the timer first by running `.timer on`! -==== - -[source, sql] ----- -SELECT * FROM flights ORDER BY DepTime LIMIT 1000; ----- - -.Output ----- -Run Time: real 0.095 user 0.009746 sys 0.014301 ----- - -Wow! That is some _serious_ improvement. What does the "plan" look like? - -[source, sql] ----- -EXPLAIN QUERY PLAN SELECT * FROM flights WHERE DepTime = 1232; ----- - -You'll notice the "plan" shows it will utilize the index to speed the query up. Great! - -Finally, take a glimse to see how much space the database takes up now. Mine took 623M! An increase of about 14%. Not bad! - -.Items to submit -==== -- Screenshots of your terminal output showing the following: - - The size of your database before adding the index. - - The size of your database after adding the index. - - The time it took to run the query before adding the index. - - The time it took to run the query after adding the index. - - The "plan" for the query before adding the index. - - The "plan" for the query after adding the index. -==== - -=== Question 5 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_gn75w8nj?wid=_983291"></iframe> -++++ - -We hope that this project has given you a small glimpse into the "other side" of databases. Now, write a query that uses one or more other columns. Time the query, then, create a _new_ index to speed the query up. Time the query _after_ creating the index. Did it work well? - -Document the steps of this problem just like you did for question (4). - -**Optional challenge:** Try to make your query utilize 2 columns and create an index on both columns to see if you can get a speedup. - -.Items to submit -==== -- Screenshots of your terminal output showing the following: - - The size of your database before adding the index. - - The size of your database after adding the index. - - The time it took to run the query before adding the index. - - The time it took to run the query after adding the index. - - The "plan" for the query before adding the index. - - The "plan" for the query after adding the index. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc deleted file mode 100644 index 3a151b56a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-project13.adoc +++ /dev/null @@ -1,224 +0,0 @@ -= TDM 20100: Project 13 -- 2022 - -**Motivation:** We've covered a lot about SQL in a relatively short amount of time, but we still haven't touched on some other important SQL topics. In this final project, we will touch on some other important SQL topics. - -**Context:** In the previous project, you had the opportunity to take the time to insert data into a `sqlite3` database. There are still many common tasks that you may need to perform using a database: triggers, views, transaction, and even a few `sqlite3`-specific functionalities that may prove useful. - -**Scope:** SQL - -.Learning Objectives -**** -- Create a trigger on your `sqlite3` database and demonstrate that it works. -- Create one or more views on your `sqlite3` database and demonstrate that they work. -- Describe and use a database transaction. Rollback a transaction. -- Optionally, use the `sqlite3` "savepoint", "rollback to", and "release" commands. -- Optionally, use the `sqlite3` "attach" and "detach" commands to execute queries across multiple databases. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/flights_sample.db` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_sngu0vft?wid=_983291"></iframe> -++++ - -Begin by copying the database from the previous project to your `$HOME` directory. Open up a terminal and run the following. - -[source,bash] ----- -cp /anvil/projects/tdm/data/flights/subset/flights_sample.db $HOME ----- - -Go ahead and launch `sqlite3` and connect to the database. - -[source,bash] ----- -sqlite3 $HOME/flights_sample.db ----- - -From within `sqlite3`, test things out to make sure the data looks right. - -[source, sql] ----- -.header on -SELECT * FROM flights LIMIT 5; ----- - -.expected output ----- -Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay -2007|1|1|1|1232|1225|1341|1340|WN|2891|N351|69|75|54|1|7|SMF|ONT|389|4|11|0||0|0|0|0|0|0 -2007|1|1|1|1918|1905|2043|2035|WN|462|N370|85|90|74|8|13|SMF|PDX|479|5|6|0||0|0|0|0|0|0 -2007|1|1|1|2206|2130|2334|2300|WN|1229|N685|88|90|73|34|36|SMF|PDX|479|6|9|0||0|3|0|0|0|31 -2007|1|1|1|1230|1200|1356|1330|WN|1355|N364|86|90|75|26|30|SMF|PDX|479|3|8|0||0|23|0|0|0|3 -2007|1|1|1|831|830|957|1000|WN|2278|N480|86|90|74|-3|1|SMF|PDX|479|3|9|0||0|0|0|0|0|0 ----- - -With any luck, things should be working just fine. - -Let's go ahead and create a trigger. A trigger is what it sounds like, given a specific action, _do_ a specific action. This is a powerful tool. One of the most common uses of a trigger that you will see in the wild is the "updated_at" field. This is a field that stores a datetime value, and uses a _trigger_ to automatically update to the current date and time anytime a record in the database is updated. - -First, we need to create a new column called "updated_at", and set the default value to something. In our case, lets set it to January 1, 1970 at 00:00:00. - -[source, sql] ----- -ALTER TABLE flights ADD COLUMN updated_at DATETIME DEFAULT '1970-01-01 00:00:00'; ----- - -If you query the table now, you will see all of the values have been properly added, great! - -[source, sql] ----- -SELECT * FROM flights LIMIT 5; ----- - -Now add a trigger called "update_updated_at" that will update the "updated_at" column to the current date and time whenever a record is updated. Check out the official documentation https://www.sqlite.org/lang_createtrigger.html[here] for examples of triggers. - -Once your trigger has been written, go ahead and test it out by updating the following record. - -[source, sql] ----- -UPDATE flights SET Year = 5555 WHERE Year = 2007 AND Month = 1 AND DayofMonth = 1 AND DayOfWeek = 1 AND DepTime = 1225 AND Origin = 'SMF'; ----- - -[source, sql] ----- -SELECT * FROM flights WHERE Year = 5555; ----- - -If it worked right, your `updated_at` column should have been updated to the current date and time, cool! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Output from connecting to the database from inside your Jupyter notebook and running the `SELECT * FROM flights WHERE Year = 5555;` query. -==== - -=== Question 2 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_x73rgj7q?wid=_983291"></iframe> -++++ - -Next, we will touch on _views_. A view is essentially a virtual table that is created from some query and given a name. Why would you want to create such a thing? Well, there could be many reasons. - -Maybe you have a complex query that you need to run frequently, and it would just be easier to see the final result with a click? Maybe the database has horrible naming conventions and you want to rename things in a view to make it more readable and/or queryable? - -After some thought, it may occur to you that we've had such an instance where a view could be nice using our `imdb.db` database! - -Copy the `imdb.db` to your `$SCRATCH` directory, and navigate to your `$SCRATCH` directory. - -[source,bash] ----- -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH -cd $SCRATCH ----- - -Sometimes, it would be nice to have the `rating` and `votes` from the `ratings` table available directly from the titles table, wouldn't it? It has been a bit of a hassle to access that information and use a JOIN whenever we've had a need to see rating information. In fact, if you think about it, the rating information living in its own table doesn't really make that much sense. - -Create a _view_ called `titles_with_ratings` that has all of the information from the `titles` table along with the `rating` and `votes` from the `ratings` table. You can find the official documentation https://www.sqlite.org/lang_createview.html[here]. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Output from connecting to the database from inside your Jupyter notebook and running `SELECT * FROM titles_with_ratings LIMIT 5;` query. -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_vhkyl6df?wid=_983291"></iframe> -++++ - -Read the offical `sqlite3` documentation for transactions https://www.sqlite.org/lang_transaction.html[here]. As you will read, you've already been using transactions each time you run a query! What we will focus on is how to use transactions to _rollback_ changes, as this is probably the most useful use case you'll run into. - -Connect to our `flights_sample.db` database from question (1), start a _deferred_ transaction, and update a row, similar to what we did before, using the following query. - -[source, sql] ----- -UPDATE flights SET Year = 7777 WHERE Year = 5555; ----- - -Now, query the record to see what it looks like. - -[source, sql] ----- -SELECT * FROM flights WHERE Year = 7777; ----- - -[NOTE] -==== -You'll notice our _trigger_ from before is still working, cool! -==== - -This is pretty great, until you realized that the year should most definitely _not_ be 7777, but rather be 5555. Oh no! Well, at this stage you haven't committed your transaction yet, so you can just _rollback_ the changes and everything will be back to normal. Give it a try (again, following the official documentation). - -After rolling back, run the following query. - -[source, sql] ----- -SELECT * FROM flights WHERE Year = 7777; ----- - -As you can see, nothing appears! Let's try with the correct year. - -[source,sql] ----- -SELECT * FROM flights WHERE Year = 5555; ----- - -Nice! Note only was our `Year` field rolled back to the original values after question (1), but our `updated_at` field was too, excellent! As you can imagine, this is pretty powerful stuff, especially if you are writing to a database and want to make sure things look right before _committing_ the changes. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- A screenshot in your Jupyter notebook showing the series of queries that demonstrated your rollback worked as planned. -==== - -=== Question 4 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_8kuku754?wid=_983291"></iframe> -++++ - -SQL and `sqlite3` are powerful tools, and we've barely scratched the surface. Check out the https://www.sqlite.org/docs.html[offical documentation], and demonstrate another feature of `sqlite3` that we haven't yet covered. - -Some suggestions, if you aren't interested in browsing the documentation: https://www.sqlite.org/windowfunctions.html#biwinfunc[window functions], https://www.sqlite.org/lang_mathfunc.html[math functions], https://www.sqlite.org/lang_datefunc.html[date and time functions], and https://www.sqlite.org/lang_corefunc.html[core functions] (there are many we didn't use!) - -Please make sure the queries you run are run from an sql cell in your Jupyter notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (optional, 0 pts) - -There are two other interesting features of `sqlite3`: https://www.sqlite.org/lang_savepoint.html[savepoints] (kind of a named transaction) and https://www.sqlite.org/lang_attach.html[attach and detach]. Demonstrate one or both of these functionalities and write 1-2 sentences stating whether or not you think they are practical or useful features, and why or why not? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc deleted file mode 100644 index 64b081219..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/20100/20100-2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= TDM 20100 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$20100-2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm20100[https://piazza.com/purdue/fall2022/tdm20100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm20100/home[https://piazza.com/purdue/fall2022/tdm20100/home] - -== Syllabus - -See xref:fall2022/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc deleted file mode 100644 index 1e48cdfd2..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project01.adoc +++ /dev/null @@ -1,255 +0,0 @@ -= TDM 30100: Project 1 -- 2022 - -**Motivation:** It’s been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python with data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time. - -We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics. - -**Scope:** Jupyter Lab, R, Python, Anvil, markdown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Review. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_5vtofjko?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_1gf9pnt2?wid=_983291"></iframe> -++++ - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[NOTE] -==== -Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful? -==== - -.Items to submit -==== -- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil. -- A sentence explaining how many cores and how much memory is available, in total, for your own computer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb. - -[NOTE] -==== -It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up. - -[source,bash] ----- -squeue -u username # replace 'username' with your username ----- -==== - -[NOTE] -==== -If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2022-s2023:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2022-s2023-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you. - -[NOTE] -==== -Soon, we'll have the f2022-s2023-r kernel available and ready to use! -==== - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node on Anvil that you are running on). -==== - -=== Question 3 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_6s6gsi1e?wid=_983291"></iframe> -++++ - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_708jtb6h?wid=_983291"></iframe> -++++ - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know! - -Practice running the following examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -[NOTE] -==== -In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements! - -[source,ipython] ----- -%load_ext sql ----- -==== - -bash:: -[source,bash] ----- -%%bash - -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv ----- - -[TIP] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 5 - -Make a markdown cell containing a list of every topic and/or tool you wish was taught in The Data Mine -- in order of _most_ interested to _least_ interested. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -++++ -<iframe class="video" src="https://cdnapisec.kaltura.com/html5/html5lib/v2.79.1/mwEmbedFrame.php/p/983291/uiconf_id/29134031/entry_id/1_dsk4jniu?wid=_983291"></iframe> -++++ - -Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. - -[NOTE] -==== -Your `bash` solution can be both plotless and without a custom function. -==== - -Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc deleted file mode 100644 index 403edbcd6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project02.adoc +++ /dev/null @@ -1,275 +0,0 @@ -= TDM 30100: Project 2 -- 2022 - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the first project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/apple/health/watch_dump.xml` - -== Questions - -In this project we will work with `pdoc` to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data -- a common component to many data science and computer science projects, and a key topics to understand when working with APIs. - -For the sake of clarity, this project will have more deliverables than the "standard" `.ipynb` notebook, `.py` file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question. - -[WARNING] -==== -Make sure to select 4096 MB of RAM for this project. Otherwise you may get an issue reading the dataset in question 3. -==== - -=== Question 1 - -Let's start by navigating to https://ondemand.anvil.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the `.ipynb` file). Jupyter Lab is actually _much_ more useful. You can open terminals on Anvil (the cluster), as well as open a an editor for `.R` files, `.py` files, or any other text-based file. - -Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "f2022-s2023" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called `untitled.py`. Rename this file to `firstname-lastname-project02.py` (where `firstname` and `lastname` are your first and last name, respectively). - -[TIP] -==== -Make sure you are in your `$HOME` directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file. -==== - -Read the https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings["3.8.2 Modules" section] of Google's Python Style Guide. Each individual `.py` file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why? - -[TIP] -==== -Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the **serialized** data _looks_ like (if you tried to open it in a text editor, for example), or how it is _read_. -==== - -Save your module. - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Now, in Jupyter Lab, open a new notebook using the "f2022-s2023" kernel. - -[TIP] -==== -You can have _both_ the Python file _and_ the notebook open in separate Jupyter Lab tabs for easier navigation. -==== - -Fill in a code cell for question 1 with a Python comment. - -[source,python] ----- -# See firstname-lastname-project02.py ----- - -For this question, read the xref:programming-languages:python:pdoc.adoc[pdoc section], and run a `bash` command to generate the documentation for your module that you created in the previous question, `firstname-lastname-project02.py`. To do this, look at the example provided in the book. Everywhere in the example in the pdoc section of the book where you see "mymodule.py" replace it with _your_ module's name -- `firstname-lastname-project02.py`. - -[CAUTION] -==== -Use `python3` **not** `python` in your command. - -We are expecting you to run the command in a `bash` cell, however, if you decide to run it in a terminal, please make sure to document your command. In addition, you'll need to run the following in order for `pdoc` to be recognized as a module. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Then you can run your command. - -[source,bash] ----- -python3 -m pdoc other commands here ----- -==== - -[TIP] -==== -Use the `-o` flag to specify the output directory -- I would _suggest_ making it somewhere in your `$HOME` directory to avoid permissions issues. - -For example, I used `$HOME/output`. -==== - -Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your output directory. You should see something called `firstname-lastname-project02.html`. To view this file in your browser, right click on the file, and select btn:[Open in New Browser Tab]. A new browser tab should open with your freshly made documentation. Pretty cool! - -[IMPORTANT] -==== -Ignore the `index.html` file -- we are looking for the `firstname-lastname-project02.html` file. -==== - -[TIP] -==== -You _may_ have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation. -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -[NOTE] -==== -When I refer to "watch data" I just mean the dataset for this project. -==== - -Write a function to called `get_records_for_date` that accepts an `lxml` etree (of our watch data, via `etree.parse`), and a `datetime.date`, and returns a list of Record Elements, for a given date. Raise a `TypeError` if the date is not a `datetime.date`, or if the etree is not an `lxml.etree`. - -Use the https://google.github.io/styleguide/pyguide.html#383-functions-and-methods[Google Python Style Guide's "Functions and Methods" section] to write the docstring for this function. Be sure to include type annotations for the parameters and return value. - -Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read. - -Use the `-d` flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look? - -[TIP] -==== -The following code should help get you started. - -[source,python] ----- -import lxml -import lxml.etree -from datetime import datetime, date - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]: - # docstring goes here - - # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not - - # test if `for_date` is a `datetime.date`, and raise TypeError if not - - # loop through the records in the watch data using the xpath expression `/HealthData/Record` - # how to see a record, in case you want to - print(lxml.etree.tostring(record)) - - # test if the record's `startDate` is the same as `for_date`, and append to a list if it is - - # return the list of records - -# how to test this function -tree = etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml') -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) -my_records ----- - -.output ----- -[<Element Record at 0x7ffb7c27a440>, - <Element Record at 0x7ffb7c27a480>, - <Element Record at 0x7ffb7c27a4c0>, - <Element Record at 0x7ffb7c27a500>, - <Element Record at 0x7ffb7c27a540>, - <Element Record at 0x7ffb7c27a580>, - <Element Record at 0x7ffb7c27a5c0>, - <Element Record at 0x7ffb7c27a600>, - <Element Record at 0x7ffb7764e3c0>, - <Element Record at 0x7ffb7764e400>, - <Element Record at 0x7ffb7764e440>, - <Element Record at 0x7ffb7764e480>, - .... ----- -==== - -[TIP] -==== -The following is some code that will be helpful to test the types. - -[source,python] ----- -from datetime import datetime, date - -isinstance(some_date_object, date) # test if some_date_object is a date -isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree ----- -==== - -[TIP] -==== -To loop through records, you can use the `xpath` method. - -[source,python] ----- -for record in tree.xpath('/HealthData/Record'): - # do something with record ----- -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This was _hopefully_ a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code. - -Finally, investigate the https://pdoc.dev/docs/pdoc.html[official pdoc documentation], and make at least 2 changes/customizations to your module. Some examples are below -- feel free to get creative and do something with pdoc outside of this list of options: - -- Modify the module so you do not need to pass the `-d` flag in order to let pdoc know that you are using Google-style docstrings. -- Change the logo of the documentation to your own logo (or any logo you'd like). -- Add some math formulas and change the output accordingly. -- Edit and customize pdoc's jinja2 template (or CSS). - -[CAUTION] -==== -For this project, please submit the following files: - -- The `.ipynb` file with: - - a simple comment for question 1, - - a `bash` cell for question 2 with code that generates your `pdoc` html documentation, - - a code cell with your `get_records_for_date` function (for question 3) - - a code cell with the results of running - + -[source,python] ----- -# read in the watch data -tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml') - -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) -my_records ----- - - a `bash` code cell with the code that generates your `pdoc` html documentation (using the google styles) - - a markdown cell describing the changes you made for question 4. -- An `.html` file with your newest set of documention (including your question 4 modifications) -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc deleted file mode 100644 index 26634e5aa..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project03.adoc +++ /dev/null @@ -1,459 +0,0 @@ -= TDM 30100: Project 3 -- 2022 - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/apple/health/watch_dump.xml` - -== Questions - -[WARNING] -==== -Please use Firefox for this project. We can't guarantee good results if you do not. - -Before you begin, open Firefox, and where you would normally put a URL, type the following, followed by enter/return. - -``` -about:config -``` - -Search for "network.cookie.sameSite.laxByDefault", and change the value to `false`, and close the tab. -==== - -=== Question 1 - -. Create a new directory in your `$HOME` directory called `project03`: `$HOME/project03` -. Create a new Jupyter notebook in that folder called project03.ipynb, based on the normal project template: `$HOME/project03/project03.ipynb` -+ -[IMPORTANT] -==== -The majority of this notebook will just contain a single bash cell with the commands used to re-generate the documentation. This is okay, and by design. The main deliverable for this project will end up being some output from the documentation generator -- this will be explicitly specified as we go along and at the end of the project. -==== -. Create a module called `firstname_lastname_project03.py` in your `$HOME/project03` directory, with the following contents. -+ -[source,python] ----- -"""This module is for project 3 for TDM 30100. - -**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case. - -**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form. - -The following are some common serialization formats: - -- JSON -- Bincode -- MessagePack -- YAML -- TOML -- Pickle -- BSON -- CBOR -- Parquet -- XML -- Protobuf - -**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory. - -**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form. - -Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor. -""" - - -import lxml -import lxml.etree -from datetime import datetime, date - - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list: - """ - Given an `lxml.etree` object and a `datetime.date` object, return a list of records - with the startDate equal to `for_date`. - Args: - tree (lxml.etree): The watch_dump.xml file as an `lxml.etree` object. - for_date (datetime.date): The date for which returned records should have a startDate equal to. - Raises: - TypeError: If `tree` is not an `lxml.etree` object. - TypeError: If `for_date` is not a `datetime.date` object. - Returns: - list: A list of records with the startDate equal to `for_date`. - """ - - if not isinstance(tree, lxml.etree._ElementTree): - raise TypeError('tree must be an lxml.etree') - - if not isinstance(for_date, date): - raise TypeError('for_date must be a datetime.date') - - results = [] - for record in tree.xpath('/HealthData/Record'): - if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date(): - results.append(record) - - return results ----- -+ -[IMPORTANT] -==== -Make sure you change "firstname" and "lastname" to _your_ first and last name. -==== -+ -. In a `bash` cell in your `project03.ipynb` notebook, run the following. -+ -[source,ipython] ----- -%%bash - -cd $HOME/project03 -python3 -m sphinx.cmd.quickstart ./docs -q -p project03 -a "Firstname Lastname" -v 1.0.0 --sep ----- -+ -[IMPORTANT] -==== -Please replace "Firstname" and "Lastname" with your own name. -==== -+ -[NOTE] -==== -What do all of these arguments do? Check out https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official documentation]. -==== - -You should be left with a newly created `docs` directory within your `project03` directory: `$HOME/project03/docs`. The directory structure should look similar to the following. - -.contents ----- -project03<1> -├── 39000_f2021_project03_solutions.ipynb<2> -├── docs<3> -│   ├── build <4> -│   ├── make.bat -│   ├── Makefile <5> -│   └── source <6> -│   ├── conf.py <7> -│   ├── index.rst <8> -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py<9> - -5 directories, 6 files ----- - -<1> Our module (named `project03`) folder -<2> Your project notebook (probably named something like `firstname_lastname_project03.ipynb`) -<3> Your documentation folder -<4> Your empty build folder where generated documentation will be stored -<5> The Makefile used to run the commands that generate your documentation -<6> Your source folder. This folder contains all hand-typed documentation -<7> Your conf.py file. This file contains the configuration for your documentation. -<8> Your index.rst file. This file (and all files ending in `.rst`) is written in https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[reStructuredText] -- a Markdown-like syntax. -<9> Your module. This is the module containing the code from the previous project, with nice, clean docstrings. - -Please make the following modifications: - -. To Makefile: -+ -[source,bash] ----- -# replace -SPHINXOPTS ?= -SPHINXBUILD ?= sphinx-build -SOURCEDIR = source -BUILDDIR = build - -# with the following -SPHINXOPTS ?= -SPHINXBUILD ?= python3 -m sphinx.cmd.build -SOURCEDIR = source -BUILDDIR = build ----- -+ -. To conf.py: -+ -[source,python] ----- -# CHANGE THE FOLLOWING CONTENT FROM: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -# import os -# import sys -# sys.path.insert(0, os.path.abspath('.') - -# TO: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -import os -import sys -sys.path.insert(0, os.path.abspath('../..')) ----- - -Finally, with the modifications above having been made, run the following command in a `bash` cell in Jupyter notebook to generate your documentation. - -[source,bash] ----- -cd $HOME/project03/docs -make html ----- - -After complete, your module folders structure should look something like the following. - -.structure ----- -project03 -├── 39000_f2021_project03_solutions.ipynb -├── docs -│   ├── build -│   │   ├── doctrees -│   │   │   ├── environment.pickle -│   │   │   └── index.doctree -│   │   └── html -│   │   ├── genindex.html -│   │   ├── index.html -│   │   ├── objects.inv -│   │   ├── search.html -│   │   ├── searchindex.js -│   │   ├── _sources -│   │   │   └── index.rst.txt -│   │   └── _static -│   │   ├── alabaster.css -│   │   ├── basic.css -│   │   ├── custom.css -│   │   ├── doctools.js -│   │   ├── documentation_options.js -│   │   ├── file.png -│   │   ├── jquery-3.5.1.js -│   │   ├── jquery.js -│   │   ├── language_data.js -│   │   ├── minus.png -│   │   ├── plus.png -│   │   ├── pygments.css -│   │   ├── searchtools.js -│   │   ├── underscore-1.13.1.js -│   │   └── underscore.js -│   ├── make.bat -│   ├── Makefile -│   └── source -│   ├── conf.py -│   ├── index.rst -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py - -9 directories, 29 files ----- - -Finally, let's take a look at the results! In the left-hand pane in the Jupyter Lab interface, navigate to `$HOME/project03/docs/build/html/`, and right click on the `index.html` file and choose btn:[Open in New Browser Tab]. You should now be able to see your documentation in a new tab. It should look something like the following. - -image::figure34.webp[Resulting Sphinx output, width=792, height=500, loading=lazy, title="Resulting Sphinx output"] - -[IMPORTANT] -==== -Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation. -==== - -.Items to submit -==== -- Code used to solve this problem (in 2 Jupyter `bash` cells). -==== - -=== Question 2 - -One of the most important documents in any package or project is the `README.md` file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like `numpy`, `pytorch`, or any other repository you've come across that you believe does a good job explaining the project. - -In the `docs/source` folder, create a new file called `README.rst`. Choose 3-5 of the following "types" of reStruturedText from the https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[this webpage], and create a fake README. The content can be https://www.lipsum.com/[Lorem Ipsum] type of content as long as it demonstrates 3-5 of the types of reStruturedText. - -- Inline markup -- Lists and quote-like blocks -- Literal blocks -- Doctest blocks -- Tables -- Hyperlinks -- Sections -- Field lists -- Roles -- Images -- Footnotes -- Citations -- Etc. - -[IMPORTANT] -==== -Make sure to include at least 1 https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections[section]. This counts as 1 of your 3-5. -==== - -Once complete, add a reference to your README to the `index.rst` file. To add a reference to your `README.rst` file, open the `index.rst` file in an editor and add "README" as follows. - -.index.rst -[source,rst] ----- -.. project3 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project3's documentation! -==================================== - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -[IMPORTANT] -==== -Make sure "README" is aligned with ":caption:" -- it should be 3 spaces from the left before the "R" in "README". -==== - -In a new `bash` cell in your notebook, regenerate your documentation. - -[source,ipython] ----- -%%bash - -cd $HOME/project03/docs -make html ----- - -Check out the resulting `index.html` page, and click on the links. Pretty great! - -[TIP] -==== -Things should look similar to the following images. - -image::figure35.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"] - -image::figure36.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"] -==== - -.Items to submit -==== -- Screenshot labeled "question02_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -- OR a PDF created by exporting the webpage. -==== - -=== Question 3 - -The `pdoc` package was specifically designed to generate documentation for Python modules using the docstrings _in_ the module. As you may have noticed, this is not "native" to Sphinx. - -Sphinx has https://www.sphinx-doc.org/en/master/usage/extensions/index.html[extensions]. One such extension is the https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html[autodoc] extension. This extension provides the same sort of functionality that `pdoc` provides natively. - -To use this extension, modify the `conf.py` file in the `docs/source` folder. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc' -] ----- - -Next, update your `index.rst` file so autodoc knows which modules to extract data from. - -[source,rst] ----- -.. project3 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project3's documentation! -==================================== - -.. automodule:: firstname_lastname_project03 - :members: - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Not too bad! - -.Items to submit -==== -- Screenshot labeled "question03_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -- OR a PDF created by exporting the webpage. -==== - -=== Question 4 - -Okay, while the documentation looks pretty good, clearly, Sphinx does _not_ recognize Google style docstrings. As you may have guessed, there is an extension for that. - -Add the `napoleon` extension to your `conf.py` file. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc', - 'sphinx.ext.napoleon' -] ----- - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Much better! - -.Items to submit -==== -- Screenshot labeled "question04_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -- OR a PDF created by exporting the webpage. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc deleted file mode 100644 index 051d464e4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project04.adoc +++ /dev/null @@ -1,205 +0,0 @@ -= TDM 30100: Project 4 -- 2022 - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -=== Question 1 - -The listed datasets are fairly large, and interesting! They are `json` formatted data. Each _row_ of a single `json` file can be individually read in and processed. Take a look at a single row. - -[source,ipython] ----- -%%bash - -head -n 1 /anvil/projects/tdm/data/goodreads/goodreads_books.json ----- - -This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it. - -[source,python] ----- -import json - -with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f: - for line in f: - print(line) - parsed = json.loads(line) - print(f"{parsed['isbn']=}") - print(f"{parsed['num_pages']=}") - break ----- - -In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of `pdoc` or `sphinx` to generate a pretty set of documentation. - -Begin this project by choosing a tool, `pdoc` or `sphinx`, and setting up a `firstname-lastname-project04.py` module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your `.ipynb` notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write a function called `scrape_image_from_url` that accepts a URL (as a string) and returns a `bytes` object of the data. Make sure `scrape_image_from_url` cleans up after itself and doesn't leave any image files on the filesystem. - -. Create a variable with a temporary file name using the `uuid` package. -. Use the `requests` package to get the response. -+ -[TIP] -==== -[source,python] ----- -import requests - -response = requests.get(url, stream=True) - -# then the first argument to copyfileobj will be response.raw ----- -==== -+ -. Open the file and use the `shutil` packages `copyfileobj` method to copy the `response.raw` to the file. -. Open the file and read the contents into a `bytes` object. -+ -[TIP] -==== -You can verify a bytes object by: - -[source,python] ----- -type(my_object) ----- - -.output ----- -bytes ----- -==== -+ -. Use `os.remove` to remove the image file. -. Return the bytes object. - - -You can verify your function works by running the following: - -[source,python] ----- -import shutil -import requests -import os -import uuid -import hashlib - -url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' -my_bytes = scrape_image_from_url(url) -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output ----- -ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write a function called `json_to_sql` that accepts a single row of the `goodreads_books.json` file (as a string), a table name (as a string), as well as a `set` of values to "skip". This function should then return a string that is a valid `INSERT INTO` SQL statement. See https://www.sqlitetutorial.net/sqlite-insert/[here] for an example of an `INSERT INTO` statement. - -The following is a real example you can test out. - -[source,python] ----- -with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f: - for line in f: - first_line = str(line) - break - -first_line ----- - -[source,python] ----- -json_to_sql(first_line, 'books', {'series', 'popular_shelves', 'authors', 'similar_books'}) ----- - -.output ----- -"INSERT INTO books (isbn,text_reviews_count,country_code,language_code,asin,is_ebook,average_rating,kindle_asin,description,format,link,publisher,num_pages,publication_day,isbn13,publication_month,edition_information,publication_year,url,image_url,book_id,ratings_count,work_id,title,title_without_series) VALUES ('0312853122','1','US','','','false','4.00','','','Paperback','https://www.goodreads.com/book/show/5333265-w-c-fields','St. Martin's Press','256','1','9780312853129','9','','1984','https://www.goodreads.com/book/show/5333265-w-c-fields','https://images.gr-assets.com/books/1310220028m/5333265.jpg','5333265','3','5400751','W.C. Fields: A Life on Film','W.C. Fields: A Life on Film');" ----- - -[TIP] -==== -Here is some (maybe) helpful logic: - -. Use the `loads` to convert json to a dict. -. Remove all key:value pairs from the dict where the key is in the `skip` set. -. Form a string of comma separated keys. -. Form a string of comma separated, single-quoted values. -. Assemble the `INSERT INTO` statement. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Create a new function, that does something interesting with one or more of these datasets. Just like _all_ the previous functions, make sure to include detailed and clear docstrings. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Generate your final documentation, and assemble and submit your deliverables: - -- `.ipynb` file testing out your functions. -- `firstname-lastname-project04.py` module that includes all of your functions, and associated docstrings. -- Screenshots and/or a PDF exported from your resulting documentation web page. Basically, something that shows us your resulting documentation. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc deleted file mode 100644 index 3053a6b40..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project05.adoc +++ /dev/null @@ -1,149 +0,0 @@ -= TDM 30100: Project 5 -- 2022 - -**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have. - -**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data. - -**Scope:** Python, testing, pytest, mypy, doc tests - -.Learning Objectives -**** -- Write and run unit tests using `pytest`. -- Include and run doc tests in your docstrings, using `pytest`. -- Gain familiarity with `mypy`, and explain why static type checking can be useful. -- Comprehend what a function is, and the components of a function in Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -=== Question 1 - -There are a variety of different testing packages: `doctest`, `unittest`, `nose`, `pytest`, etc. In addition, you can write actual tests, or even include tests in your documentation! - -For the sake of simplicity, we will stick to using two packages: `pytest` and `mypy`. - -Create a new working directory in your `$HOME` directory. - -[source,bash] ----- -mkdir $HOME/project05 ----- - -Copy the following, provided Python module to your working directory. - -[source,bash] ----- -cp /anvil/projects/tdm/data/goodreads/goodreads.py $HOME/project05 ----- - -Look at the module. Use `pytest` to run the doctests in the module. - -[TIP] -==== -See https://docs.pytest.org/en/7.1.x/how-to/doctest.html[here] for instructions on how to run the doctests using `pytest`. -==== - -[NOTE] -==== -One of the tests will fail. This is okay! We will take care of that later. -==== - -[NOTE] -==== -Run the doctests from within a `bash` cell, so the output shows in the Jupyter Notebook. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -One of the doctests failed. Why? Go ahead and fix it so the test passes. - -[WARNING] -==== -This does _not_ mean modifiy the test itself -- the test is written exactly as intended. Fix the _code_ to handle that scenario. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Add 1 more doctest to `split_json_to_n_parts`, 3 to `get_book_with_isbn`, and 2 more to `get_books_by_author_name`. In a bash cell, re-run your tests, and make sure they all pass. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Doctests are great, but a bit clunky. It is likely better to have 1 or 2 doctests for a function that documents _how_ to use the function with a concrete example, rather than putting all your tests as doctests. Think of doctests more along the lines of documenting usage, and as a bonus you get a couple extra tests to run. - -For example, the first `split_json_to_n_parts` doctest, would be much better suited as a unit test, so it doesn't crowd the readability of the docstring. Create a `test_goodreads.py` module in the same directory as your `goodreads.py` module. Move the first doctest from `split_json_to_n_parts` into a `pytest` unit test. - -In a bash cell, run the following in order to make sure the test passes. - -[source,ipython] ----- -%%bash - -cd ~/project05 -python3 -m pytest ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Include your `scrape_image_from_url` function from the previous project in your `goodreads.py`. Write at least 1 doctest and at least 1 unit test for this function. Make sure the tests pass. Run the tests from a bash cell so the graders can see the output. - -[NOTE] -==== -For this question, it is okay if the doctest and unit test test the same thing. This is all just for practice. -==== - -[WARNING] -==== -Make sure you submit the following files: - -- the `.ipynb` notebook with all cells executed and output displayed (including the output of the tests). -- the `goodreads.py` file containing all of your code. -- the `test_goodreads.py` file containing all of your unit tests (should be 2 unit tests total). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc deleted file mode 100644 index 7f3e13ea8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project06.adoc +++ /dev/null @@ -1,78 +0,0 @@ -= TDM 30100: Project 6 -- 2022 - -**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have. - -**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data. - -**Scope:** Python, testing, pytest, mypy, doc tests - -.Learning Objectives -**** -- Write and run unit tests using `pytest`. -- Include and run doc tests in your docstrings, using `pytest`. -- Gain familiarity with `mypy`, and explain why static type checking can be useful. -- Comprehend what a function is, and the components of a function in Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -[NOTE] -==== -We will dig in a little deeper in the next project, however, this project is designed to give you a little bit of a rest before October break. -==== - -We've provided you with two files for this project: - -. `/anvil/projects/tdm/etc/project06.py` -. `/anvil/projects/tdm/etc/test_project06.py` - -Start by copying these files to your own working directory. - -[source,ipython] ----- -%%bash - -rm -rf $HOME/project06 || true -mkdir $HOME/project06 -cp /anvil/projects/tdm/etc/project06.py $HOME/project06 -cp /anvil/projects/tdm/etc/test_project06.py $HOME/project06 ----- - -The first file, `project06.py` is a module with a bunch of functions. The second file, `test_project06.py` is the set of tests for the `project06.py` module. You can run the tests as follows. - -[source,ipython] ----- -%%bash - -cd $HOME/project06 -python3 -m pytest ----- - -The goal of this project is to fix all of the code in `project06.py` so that all of the unit tests pass. Do _not_ modify the tests in `test_project06.py`, _only_ modify the code in `project06.py`. - -. Fix `find_longest_timegap`. -. Fix `space_in_dir`. -. Fix `event_plotter`. -. Fix `player_info`. - -.Items to submit -==== -- Your modified `project06.py` file. -- Your `.ipynb` notebook file with a `bash` cell showing 100 percent of your tests passing. -- Your `.ipynb` notebook file with a markdown cell for each question, and an explanation of what was wrong, and how you fixed it. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc deleted file mode 100644 index 7dc078a50..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project07.adoc +++ /dev/null @@ -1,145 +0,0 @@ -= TDM 30100: Project 7 -- 2022 - -**Motivation:** Code, especially newly written code, is refactored, updated, and improved frequently. It is for these reasons that testing code is imperative. Testing code is a good way to ensure that code is working as intended. When a change is made to code, you can run a suite a tests, and feel confident (or at least more confident) that the changes you made are not introducing new bugs. While methods of programming like TDD (test-driven development) are popular in some circles, and unpopular in others, what is agreed upon is that writing good tests is a useful skill and a good habit to have. - -**Context:** This is the first of a series of two projects that explore writing unit tests, and doc tests. In The Data Mine, we will focus on using `pytest`, doc tests, and `mypy`, while writing code to manipulate and work with data. - -**Scope:** Python, testing, pytest, mypy, doc tests - -.Learning Objectives -**** -- Write and run unit tests using `pytest`. -- Include and run doc tests in your docstrings, using `pytest`. -- Gain familiarity with `mypy`, and explain why static type checking can be useful. -- Comprehend what a function is, and the components of a function in Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/*` - -== Questions - -=== Question 1 - -In the previous project, you were given a Python module and a test module. You found bugs in the code and fix the functions to pass the provided tests. This is good practice fixing code based on tests. In this project, you will get an opportunity to try _test driven development_ (TDD). This is (roughly) where you write tests first, _then_ write your code to pass your tests. - -There are some good discussions on TDD https://buttondown.email/hillelwayne/archive/i-have-complicated-feelings-about-tdd-8403/[here] and https://news.ycombinator.com/item?id=32509268[here]. - -Start by choosing 1 dataset from our data directory: `/anvil/projects/tdm/data`. This will be the dataset which you operate on for the remainder of the project. Alternatively, you may scrape data from online as your "data source". - -In a markdown cell, describe 3 functions that you will write, and what those functions should do. - -Create two files: `$HOME/project07/project07.py` and `$HOME/project07/test_project07.py`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Expand on question (1). In `$HOME/project07/project07.py` create the 3 functions, and write detailed, https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[google style] docstrings for each function. Leave each function blank, with just a `pass` keyword. For example. - -[source,python] ----- -def my_super_awesome_function(some_parameter: str) -> str: - """ - My detailed google style docstring here... - """ - pass ----- - -[WARNING] -==== -- Make sure the reader can understand what your functions do based on the docstrings. -- Your functions should not be anything trivial like splitting a string, summing data, or anything that could be easily accomplished in a single line of code using other built-in methods. This is your chance to get creative! -==== - -[TIP] -==== -`pass` is a keyword that you can use to "pass" or just not perform any operation. Without `pass` in this function, you would get an error. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write _at least_ 2 unit tests (using `pytest`) for _each_ function. Each `assert` counts as a unit test. - -Write an additional test for each function (again, using `pytest`). For this set of tests, experiment with features from `pytest` that you haven't tried before! https://docs.pytest.org/en/7.1.x/index.html[This] is the official `pytest` documentation. Some options could be: https://docs.pytest.org/en/7.1.x/how-to/fixtures.html[fixtures], https://docs.pytest.org/en/7.1.x/how-to/parametrize.html[parametrizing], or https://docs.pytest.org/en/7.1.x/how-to/tmp_path.html[using temporary directories and files]. - -In a bash cell, run your `pytest` tests, which should all fail. - -[source,ipython] ----- -%%bash - -cd $HOME/project07 -python3 -m pytest ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Begin writing your functions by filling in `$HOME/project07/project07.py`. Record each time your re-run your tests in a new bash cell, by running the following. - -[source,ipython] ----- -%%bash - -cd $HOME/project07 -python3 -m pytest ----- - -[WARNING] -==== -Please record each re-run of your test in a **new** bash cell -- you could end up with 10 or more cells where you've run your tests. We want to see the progression as you write the functions and how the failures change as you fix your code. You _don't_ need to record the changes you make to `project07.py`, but we _do_ want to see the results of running the tests each time you run them. -==== - -Continue to re-run tests until they all pass and your functions work as intended. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -This is _perhaps_ a very different style of writing code for you, depending on your past experiences! After giving it a try, how do you like TDD? If you haven't yet, read some article online about TDD. Do you think you should always use TDD? What are your opinions. Write a few sentences about your thoughts, and where you stand on TDD. - -[WARNING] -==== -At this end of this project you should submit the following. - -- A `.ipynb` file with your results from running your tests initially in question (3), and repeatedly, until they pass, in question (4). -- Your `project07.py` file with your passing functions, and beautiful docstrings. -- Your `test_project07.py` file with at least 9 total (passing) tests, 3 of which should explore previously mentioned "new" features of `pytest`. -==== - -.Items to submit -==== -- A few sentences in a markdown cell describing what you think about TDD. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc deleted file mode 100644 index 646529411..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project08.adoc +++ /dev/null @@ -1,414 +0,0 @@ -= TDM 30100: Project 8 -- 2022 - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi. - -**Context:** This is the first in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics. - -**Scope:** Python, virtual environments, pypi - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Create a Python package and publish it on pypi. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -This project will be focused on creating, updating, and understanding Python virtual environments. Since this is The Data Mine, we will pepper in some small data-related tasks, like writing functions to operate on data, but the focus is on virtual environments. - -Let's get started. - -Use https://realpython.com/python-virtual-environments-a-primer/#how-can-you-work-with-a-python-virtual-environment[this article] as your reference. First thing is first. We have a Jupyter notebook that we tend to work in, running our `bash` code in `bash` cells. This is very different than your typical environment. For this reason, let's start by popping open a terminal, and working in the terminal. - -You can open a terminal in JupyterLab by clicking on the blue "+" button in the upper left-hand corner of the Jupyter interface. Scroll down to the last row and click on the button that says "Terminal". - -Start by taking a look at which `python3` you are running. Run the following in the terminal. - -[source,bash] ----- -which python3 ----- - -Take a look at the available packages as follows. - -[source,bash] ----- -python3 -m pip list ----- - -This doesn't look right, it doesn't look like our f2022-s2023 environment, does it? It doesn't even have `pandas` installed. This is because we don't have JupyterLab configured to have our f2022-s2023 version of Python pre-loaded in a fresh terminal session. In fact, with this project, we aren't going to use that environment! - -[NOTE] -==== -The `f2022-s2023` environment runs inside a container. You will learn more about this later on, but suffice it to say it makes it much more difficult to do what we want to do for this project. -==== - -Instead, we are going to use the non-containerized version of Python that is running the JupyterLab instance itself! To load up this environment, run the following. - -[source,bash] ----- -module load python/jupyterlab ----- - -Then, check out how things have changed. - -[source,bash] ----- -which python3 ----- - -[source,bash] ----- -python3 -m pip list ----- - -Looks like we are getting there! Let's back up a bit and explain some things. - -What does `which python3` do? `which` will print out the absolute path to the command which would be executed. In this case, running `python3` would be the same as executing `/anvil/projects/tdm/apps/python/3.10.5/bin/python3`. - -What does the `python3 -m pip` mean? The `-m` stands for https://docs.python.org/3.8/using/cmdline.html#cmdoption-m[module-name]. In a nutshell, this ensures that the correct `pip` -- the `pip` associated with the current `python3` is used! This is important, because, if you have many versions of Python installed on your system, if environment variables aren't correctly set, it could be possible to use a completely different `pip` associated with a completely different version of Python, which could cause all sorts of errors! To prevent this, it is safer to do `python3 -m pip` instead of just `pip`. - -What does `python3 -m pip list` do? The `python3 -m pip` is the same as before. The `list` command is an argument you can pass to `pip` that lists the packages installed in the current environment. - -Perform the following operations. - -. Use `venv` to create a new virtual environment called `question01`. -. Confirm that the virtual environment has been created by running the following. -+ -[source,bash] ----- -source question01/bin/activate ----- -+ -. This should _activate_ your virtual environment. You will notice that `python3` now points to an interpreter in your virtual environment directory. -+ -[source,bash] ----- -which python3 ----- -+ -.output ----- -/path/to/question01/bin/python3 ----- -+ -. In addition, you can see the blank slate when it comes to installed Python packages. -+ -[source,bash] ----- -python3 -m pip list ----- -+ -.output ----- -Package Version ----------- ------- -pip 22.0.4 -setuptools 58.1.0 -WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available. -You should consider upgrading via the '/home/x-kamstut/question01/bin/python3 -m pip install --upgrade pip' command. ----- - -.Items to submit -==== -See https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[here] on how to properly include a screenshot in your Jupyter notebook. If you do _not_ properly submit the screenshot, you will likely lose points, so _please_ take a minute to read it. - -- Screenshot showing `source question01/bin/activate` output. -- Screenshot showing `which python3` output _after_ activating the virtual environment. -- Screenshow showing `python3 -m pip list` output _after_ activating the virtual environment. -==== - -=== Question 2 - -Okay, in question (1) you ran some commands and supposedly created your own virtual environment. You are possibly still confused on what you did or why -- that is okay! Things will _hopefully_ become more clear as you progress. - -Read https://realpython.com/python-virtual-environments-a-primer/#why-do-you-need-virtual-environments[this] section of the article provided in question (1). In your own words, explain 2 good reasons why virtual environments are important when using Python. Place these explanations in a markdown cell in your notebook. - -[NOTE] -==== -We are going to create and modify and destroy environments quite a bit! Don't be intimidated by messing around with your environment. -==== - -Okay, now that you've grokked why virtual environments are important, let's try to see a virtual environment in action. - -Activate your empty virtual environment from question (1) (if it is not already active). If you were to try and import the `requests` package, what do you expect would happen? If you were to deactivate your virtual environment and then try and import the `requests` package, what would you expect would happen? - -Test out both! First activate your virtual environment from question (1), and then run `python3` and try to `import requests`. Next, run `deactivate` to deactivate your virtual environment. Run `python3` and try to `import requests`. Were the results what you expected? Please include 2 screenshots -- 1 for each attempt at importing `requests`. - -[NOTE] -==== -As you should _hopefully_ see -- the virtual environments _do_ work! When a certain environment is active, only a certain set of packages is made available! Pretty cool! -==== - -.Items to submit -==== -- 1-2 sentences, _per reason_, on why virtual environments are important when using Python. -- 1 screenshot showing the attempt to import the `requests` library from within your question01 virtual environment. -- 1 screenshot showing the attempt to import the `requests` library from outside the question01 virtual environment. -==== - -=== Question 3 - -Create a Python script called `imdb.py` that accepts a single argument, `id`, and prints out the following. - -[source,bash] ----- -python3 imdb.py imdb tt4236770 ----- - -.output ----- -Title: Yellowstone -Rating: 8.6 ----- - -You can use the following as your skeleton. - -[source,python] ----- -#!/usr/bin/env python3 - -import argparse -import sqlite3 -import sys -from rich import print - -def get_info(iid: str) -> None: - """ - Given an imdb id, print out some basic info about the title. - """ - - conn = sqlite3.connect("/anvil/projects/tdm/data/movies_and_tv/imdb.db") - cur = conn.cursor() - - # make a query (fill in code here) - - # print results - print(f"Title: [bold blue]{title}[/bold blue]\nRating: [bold green]{rating}[/bold green]") - - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - some_parser = subparsers.add_parser("imdb", help="") - some_parser.add_argument("id", help="id to get info about") - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "imdb": - get_info(args.id) - -if __name__ == "__main__": - main() ----- - -Deactivate any environment you may have active. - -[source,bash] ----- -deactivate ----- - -Confirm that the proper `python3` is active. - -[source,bash] ----- -which python3 ----- - -.output ----- -/anvil/projects/tdm/apps/python/3.10.5/bin/python3 ----- - -Now test out your script by running the following. - -[source,bash] ----- -python3 imdb.py imdb tt4236770 ----- - -What happens? Well, the package `rich` should not be installed to our current environment. Easy enough to fix, right? After all, we know how to make our own virtual environments now! - -Create a virtual environment called `question03`. This time, when creating your virtual environment, add an additional flag `--copies` to the very end of the command. Activate your virtual environment and confirm that we are using the correct environment. - -[source,bash] ----- -source question03/bin/activate -which python3 ----- - -Immediately trying the script again should fail, since we _still_ don't have the `rich` package installed. - -[source,bash] ----- -python3 imdb.py imdb tt4236770 ----- - -.output ----- -ModuleNotFoundError: No module named 'rich' ----- - -Okay! Use `pip` (using our `python3 -m pip` trick) to install `rich` and try to run the script again! - -Not only should the script now work, but, if you take a look at the packages installed in your environment, there should be some new additions. - -[source,bash] ----- -python3 -m pip list ----- - -.output ----- -Package Version ----------- ------- -commonmark 0.9.1 -pip 22.0.4 -Pygments 2.13.0 -rich 12.6.0 -setuptools 58.1.0 ----- - -That is awesome! You just solved the issue of not being able to run some Python code because a package was not installed for you. You did this by first creating your own custom Python virtual environment, installing the required package to your virtual environment, and then executing the code that wasn't previously working! - -.Items to submit -==== -- Screenshot showing the activation of the `question03` virtual environment, the `pip` install, and successful output of the script. -- Screenshot showing the resulting set of packages, `python3 -m pip list`, for the `question03` virtual environment. -==== - -=== Question 4 - -Okay, let's take a tiny step back to peek at a few underlying details of our `question01` and `question03` virtual environments. - -Specifically, start with the `question01` environment. The entire environment lives within that `question01` directory doesn't it? Or _does it!? - -[source,bash] ----- -ls -la question01/bin ----- - -Notice anything about the contents of the `question01` bin directory? They are symbolic links! `python3` actually points to the same interpreter that was active when we created the virtual environment, the `/anvil/projects/tdm/apps/python/3.10.5/bin/python3` interpreter! But wait, how do we have a different set of packages then, if we are using the same Python interpreter? The answer is, your Python interpreter will look in a variety of locations for your packages. By activating your virtual environment, we've altered our `PYTHONPATH`. - -If you run the following, you will see the list of directories that Python searches for packages, when importing. - -[source,python] ----- -import sys - -sys.path ----- - -.example output ----- -['', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python310.zip', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10', '/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/lib-dynload', '/home/x-kamstut/question01/lib/python3.10/site-packages'] ----- - -`sys.path` is initialized from the `PYTHONPATH` environment variable, plus some additional installation-dependent defaults. If you take a peek in `question01/lib/python3.10/site-packages`, you will see where `rich` is located. So, even if you look `/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages`, and see that `rich` is _not_ installed in that location, because Python searches _all_ of those locations for `rich` and `rich` _is_ installed in `question01/lib/python3.10/site-packages`, it will be successfully imported! - -This begs the question, what if `/anvil/projects/tdm/apps/python/3.10.5/lib/python3.10/site-packages` has an _older_ version of `rich` installed -- which version will be imported? Well, let's test this out! - -If you look at `plotly` in the jupyterlab environment, you will see it is version 5.8.2. - -[source,python] ----- -import plotly -plotly.__version__ ----- - -.output ----- -5.8.2 ----- - -Activate your `question03` environment and install `plotly==5.10.0`. Re-run the following code. - -[source,python] ----- -import plotly -plotly.__version__ ----- - -What is your output? Is that expected? - -[WARNING] -==== -We modified this question Thursday, October 27 due to a mistake by your instructor (Kevin). If you previously did this problem, no worries, you will get credit either way. -==== - -[NOTE] -==== -If you take a look at `question03/bin/python` you will notice that they are _not_ symbolic links, but actual copies of the original interpreter! This is what the `--copies` argument did earlier on! In general, you'll likely be fine using `venv` without the `--copies` flag. -==== - -.Items to submit -==== -- Screenshots of your operations performed from start to finish for this question. -- 1-2 sentences explaining where Python looks for packages. -==== - -=== Question 5 - -Last, but certainly not least, is the important topic of _pinning_ dependencies. This practice will allow someone else to replicate the exact set of packages needed to run your Python application. - -By default, `python3 -m pip install numpy` will install the newest compatible version of numpy to your current environment. Sometimes, that version could be too new and create issues with old code. This is why pinning is important. - -You can choose to install an exact version of a package by specifying the version. For example, you could install `numpy` version 1.16, even though the newest version is (as of writing) 1.23. Just run `python3 -m pip install numpy==1.16`. - -This is great, but is there an easy way to pass an entire list of all of the packages in your current virtual environment? Yes! Yes there is! Try it out. - -[source,bash] ----- -python3 -m pip freeze > requirements.txt -cat requirements.txt ----- - -That's pretty cool! That is a specially formatted list containing a pinned set of packages. You could do the reverse as well. Create a new file called `requirements.txt` with the following contents copied and pasted. - -.requirements.txt contents ----- -commonmark==0.9.1 -plotly==5.10.0 -Pygments==2.13.0 -requests==2.2.1 -rich==12.6.0 -tenacity==8.1.0 -thedatamine==0.1.3 ----- - -You can use the `-r` option of `pip` to install all of those pinned packages to an environment. Test it out! Create another new virtual environment called `question05`, activate the environment, and use the `-r` option and the `requirements.txt` file to install all of the packages, with the exact same versions. Double check that the results are the same, and that the installed packages are identical to the `requirements.txt` file. - -Great job! Now, with some Python code, and a `requirements.txt` file, you should be able to setup a virtual environment and run your friend or co-workers code! Very cool! - -[NOTE] -==== -Unfortunately, there is more to this mess than meets the eye, and a _lot_ more that can go wrong. But these basics will serve you well and help you solve lots and lots of problems! -==== - -.Items to submit -==== -- Screenshots showing the results of running the bash commands from the start of this question to the end. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc deleted file mode 100644 index d436c0566..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project09.adoc +++ /dev/null @@ -1,481 +0,0 @@ -= TDM 30100: Project 9 -- 2022 - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi. - -**Context:** This is the second in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics. - -**Scope:** Python, virtual environments, pypi - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Create a Python package and publish it on pypi. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -=== Question 1 - -In the previous project, the author made a mistake that _may_ have caused some confusion. Many apologies! Therefore, this first question is going to be a review of what was accomplished in the previous project. - -Like the previous project, this project will consist primarily of your Jupyter notebook, with screenshots of your terminal after running commands. `bash` cells will _not_ work properly due to the way our environment is setup. For this reason, it is important to first open a terminal from within Jupyter Lab, to run commands, take screenshots, and display the screenshots in your Jupyter Notebook (your `.ipynb` file). - -**Activate our "base" image, so we start with Python 3.10 instead of Python 3.6:** - -[source,bash] ----- -module load python/jupyterlab - -# which interpreter is active? -which python3 - -# list our current set of python packages using pip -python3 -m pip list ----- - -**Create a virtual environment:** - -[source,bash] ----- -# create the virtual environment named p9q1 in your $HOME directory -python3 -m venv $HOME/p9q1 - -# check out what files and folders the environment consist of -ls -la $HOME/p9q1 - -# which python are we currently using? -which python3 - -# which packages? -python3 -m pip list - -# activate our newly created virtual environment -source $HOME/p9q1/bin/activate - -# which python are we currently using? -which python3 - -# what packages do we have available? -python3 -m pip list ----- - -**Activate a virtual environment:** - -[source,bash] ----- -source /path/to/my/virtual/environment/bin/activate - -# for example, if our virtual environment was called myvenv in the $HOME directory -source $HOME/myvenv/bin/activate ----- - -**Deactivate a virtual environment:** - -[source,bash] ----- -deactivate -which python3 # will no longer point to interpreter inside the virtual environment folder ----- - -**Install a single package to your virtual environment:** - -[source,bash] ----- -# first activate the virtual environment -source $HOME/p9q1/bin/activate - -# install the requests package -python3 -m pip install requests ----- - -**Pin dependencies, or make the current virtual environment rebuildable by others:** - -[source,bash] ----- -# first activate the virtual environment you'd like to pin the dependencies for -source $HOME/p9q1/bin/activate - -# next, create a requirements.txt file with all of the packages, pinned -python3 -m pip freeze > $HOME/requirements.txt ----- - -**Build a fresh virtual environment using someone else's pinned dependencies:** - -[source,bash] ----- -# create the blank environment -python3 -m venv $HOME/friendsenv - -# activate the environment -source $HOME/friendsenv/bin/activate - -# install the _exact_ packages your friend had, using their requirements.txt -python3 -m pip install -r $HOME/requirements.txt - -# verify the packages are installed -python3 -m pip list -python3 -m pip freeze ----- - -**Delete a virtual environment you don't use anymore:** - -[source,bash] ----- -# IMPORTANT: Ensure you do NOT have a typo -rm -rf $HOME/p9q1 ----- - -Run through some of those commands, until it "pretty much" clicks. Take and include at least a couple screenshots -- no need to include everything if you feel comfortable with everything shown above. - -[WARNING] -==== -Make sure to take screenshots showing your input and output from the terminal throughout this project. You final submission should show all of the steps as you walk through the project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Wow! When you look at all of that information from question (1), virtual environments aren't really all that much work to use! - -Okay, if you _haven't_ already done something similar (some of you may have), I imagine this next statement is going to be pretty exciting. By the end of this project you will create a new virtual environment and `pip install` your very own package, from https://pypi.org/! - -Let's start by writing the heart and soul of your package -- a function that, given an imdb id, scrapes and returns the rating. - -. Create a new virtual environment to work in called `question02`. -. Install one or more of the following packages to your environment (you will probably want at least 2 of these to write this function): `requests`, `beautifulsoup4`, `lxml`. -. Write and test out your function, `get_rating`. - -Please include screenshots of the above steps, all the way until the end where a rating should print for an imdb title. - -[TIP] -==== -For example, https://www.imdb.com/title/tt4236770/?ref_=nv_sr_srsg_0 would have an imdb title id of tt4236770. We want the functionality to look like the following. - -[source,python] ----- -get_rating("tt4236770") ----- - -.output ----- -8.7 ----- -==== - -[TIP] -==== -You can use the following as a skeleton -- just fill in part of the xpath expression. - -[source,python] ----- -import requests -import lxml.html - -def get_rating(tid: str) -> float: - """ - Given an imdb title id, return the title's rating. - """ - resp = requests.get(f"https://www.imdb.com/title/{tid}", stream=True) - resp.raw.decode_content = True - tree = lxml.html.parse(resp.raw) - element = tree.xpath("//div[@data-testid='FILL THIS IN']/span")[0] - return float(element.text) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The next step in this process is to organize your files. Let's make this a simple, barebones setup. - -First, let's decide on the package name. Choose a package name starting with `tdm-`. For example, `tdm-drward`. Create a project directory with the same name as you package. For example, mine would be the following. - -[source,bash] ----- -mkdir $HOME/tdm-drward ----- - -Great! This will be the name you use to install via `pip`. So in my case, it would be `python3 -m pip install tdm-drward`. - -Next, create 3 new files inside the `tdm-drward` (or equivalent) folder. - -- `LICENSE` -- `pyproject.toml` -- `README.md` - -The first file is a simple text file containing the text of your license. You can use https://choosealicense.com/ to choose a license and paste the text of the license in your `LICENSE` file. - -The third file is a `README.md` -- a simple markdown file where you will eventually keep the important instructions for your package. For now, go ahead and just leave it blank. - -The second file is a critical file that will be used to specify various bits of information about you package. For now, you can leave it blank. - -Next, create a new directory _inside_ the `$HOME/tdm-drward` package directory. Name the directory whatever you want. This will be the name that is used when importing your package. For example, I made `$HOME/tdm-drward/imdb`. For my package, I will do something like: - -[source,python] ----- -import imdb - -# or - -from imdb import get_rating ----- - -Finally, copy and paste your `get_rating` function into a new file called `imdb.py`, and drop `imdb.py` into `$HOME/tdm-drward/imdb` (or your equivalent package path). In addition, create another new file called `\\__init__.py` in the same directory. Leave it blank for now. - -[TIP] -==== -Your directory structure should look something like the following. - -[source,bash] ----- -tree $HOME/tdm-drward ----- - -.directory structure ----- -tdm-drward -├── imdb -│ ├── imdb.py -│ └── __init__.py -├── LICENSE -├── pyproject.toml -└── README.md - -1 directory, 5 files ----- -==== - -Fantastic! Now, let's create a new virtual environment called `p9q3`, activate the environment, and run the following. - -[source,bash] ----- -python3 -m pip install -e $HOME/tdm-drward ----- - -This will install the package to your `p9q3` virtual environment so you can test it out and see if it is working as intended! Let's go ahead and test it to see if it is doing what we want. Run `python3` to launch a Python interpreter for our virtual environment. Run the following Python code from within the interpreter. - -[source,python] ----- -import imdb # works -print(imdb.__version__) # error -imdb.get_rating("tt4236770") # error -imdb.imdb.get_rating("tt4236770") # works -from imdb import get_rating # error -get_rating("tt4236770") # error ----- - -What happens? Well, it isn't behaving exactly like we want, but we _can_ import things. - -[source,python] ----- -import imdb.imdb -imdb.imdb.get_rating("tt4236770") # will work - -from imdb.imdb import get_rating -get_rating("tt4236770") # will also work ----- - -Here is the critial part, the `\\__init__.py` file. Any directory containing a `\\__init__.py` file is the indicator that forces Python to treat the directory as a package. If you have a complex or different directory structure, you can add code to `\\__init__.py` that will clean up your imports. When a package is imported, the code in `\\__init__.py` is executed. You can read more about this https://docs.python.org/3/tutorial/modules.html[here]. - -Go ahead and add code to `\\__init__.py`. - -[source,python] ----- -from .imdb import * - -__version__ = "0.0.1" - -print("Hi! You must have imported me!") ----- - -Re-install the package. - -[source,bash] ----- -python3 -m pip install -e $HOME/tdm-drward ----- - -Now, launch a Python interpreter again and try out our original code. - -[source,python] ----- -import imdb # works, prints your message -print(imdb.__version__) # prints 0.0.1 -imdb.get_rating("tt4236770") # works -imdb.imdb.get_rating("tt4236770") # still works -from imdb import get_rating # works -get_rating("tt4236770") # works ----- - -Wow! Okay, this should start to make a bit more sense now. Go ahead and remove the silly print statement in your `\\__init__.py` -- we don't want that anymore! - -Finally, let's take a look at the `pyproject.toml` file and fill is some info about our package. - -.pyproject.toml ----- -[build-system] -requires = ["setuptools>=61.0.0", "wheel"] -build-backend = "setuptools.build_meta" - -[project] -name = "FILL IN" -version = "0.0.1" -description = "FILL IN" -readme = "README.md" -authors = [{ name = "FILL IN", email = "FILLIN@purdue.edu" }] -license = { file = "LICENSE" } -classifiers = [ - "License :: OSI Approved :: MIT License", - "Programming Language :: Python", - "Programming Language :: Python :: 3", -] -keywords = ["example", "imdb", "tutorial", "FILL IN"] -dependencies = [ - "lxml >= 4.9.1", - "requests >= 2.28.1", -] -requires-python = ">=3.10" ----- - -Be sure to fill in the "FILL IN" parts with your information! Lastly, make sure to specify any other Python packages that _your_ package depends on in the "dependencies" section. In the provided example, I require the package "lxml" of at least version 4.9.1, as well as the"requests" package with at least version 2.28.1. This makes it so when we `pip install` our package, that these other packages and _their_ dependencies are _also_ installed -- pretty cool! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, to the best of our knowledge our package is ready to go and we want to make it publicly available to `pip install`. The next step in the process is to register an account with https://test.pypi.org https://test.pypi.org/account/register/[here]. Take note of your username and password. - -Next, confirm your email address. Open up the email you used to register and click on the link that was sent to you. - -Finally, its time to publish your package to the test package repository! - -In order to build and publish your package, we need two packages: `build` and `twine`. Let's setup a virtual environment and install those packages so we can use them! - -. Deactivate any environment that may already be active by running: `deactivate`. -. Create a new virtual environment called `p9q4`. -. Activate your `p9q4` virtual environment. -. Use `pip` to install `build` and `twine`: `python3 -m pip install build twine`. -. Build your package. -+ -[TIP] -==== -[source,bash] ----- -python3 -m build $HOME/tdm-drward ----- -==== -+ -. Check your package. -+ -[TIP] -==== -[source,bash] ----- -python3 -m twine check $HOME/tdm-drward/dist/* ----- - -You may get a warning, that is ok. -==== -+ -. Upload your package. -+ -[TIP] -==== -[source,bash] ----- -python3 -m twine upload -r testpypi $HOME/tdm-drward/dist/* ----- - -You will be prompted to enter your username and password. Enter the credentials associated with your newly created account. -==== - -Congrats! You can search for your package at https://test.pypi.org. You are ready to publish the real thing! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Okay, register for a Pypi account https://pypi.org/account/register/[here]. - -Next, verify your account by checking your associated email account and clicking on the provided link. - -At this stage, you already built your package using `python3 -m build`, so you are ready to simply upload your package! - -. Deactivate any currently active virtual environment by running: `deactivate`. -. Create a new virtual environment called `p9q5`. -. Activate your `p9q5` virtual environment. -. Use `pip` to install `twine`: `python3 -m pip install twine`. -. Upload your package: `python3 -m twine upload $HOME/tdm-drward/dist/*` -+ -[TIP] -==== -You will be prompted to enter your username and password. Enter the credentials associated with your newly created account. -==== -+ -. Fantastic! Take a look at https://pypi.org and search for your package! Even better, let's test it out! -. Your `p9q5` virtual environment should still be active, let's pip install your package! -+ -[source,python] ----- -python3 -m pip install tdm-drward ----- -+ -[TIP] -==== -Of course, replace `tdm-drward` with your package name! -==== -+ -. Finally, test it out! Launch a Python interpreter and run the following. -+ -[source,python] ----- -import imdb -imdb.get_rating("tt4236770") # success! ----- - -Congratulations! I hope you all feel empowered to create your own packages! - -[WARNING] -==== -Make sure to take screenshots showing your input and output from the terminal throughout this project. You final submission should show all of the steps as you walk through the project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc deleted file mode 100644 index 5b2ad07ec..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project10.adoc +++ /dev/null @@ -1,394 +0,0 @@ -= TDM 30100: Project 10 -- 2022 - -**Motivation:** Python is an incredible language that enables users to write code quickly and efficiently. When using Python to write code, you are _typically_ making a tradeoff between developer speed (the time it takes to write a functioning program) and program speed (how fast your code runs). Python code does _not_ have the advantage of easily being compiled to machine code and shared. In Python, you need to learn how to use virtual environments, and it is good to have an understanding of how to build and push a package to pypi. - -**Context:** This is the third in a series of 3 projects that focuses on setting up and using virtual environments, and creating a package. This is not intended to teach you everything, but rather, give you some exposure to the topics. - -**Scope:** Python, virtual environments, pypi - -.Learning Objectives -**** -- Explain what a virtual environment is and why it is important. -- Create, update, and use a virtual environment to run somebody else's Python code. -- Create a Python package and publish it on pypi. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -In the previous project, you had the opportunity to create a single-function package and publish it to pypi.org! While pretty exciting, we did gloss over some good-to-know tidbits of information. In this project, we will update the package from the previous project and cover some of these missing bits of information. Lastly, we will make modifications to the project to prime it for the API we will begin to build in the remaining projects! - -For simplicity, we are going to assume your package is called `tdm-kevin`, and lives in your home directory, with the following structure. - -[source,bash] ----- -tree $HOME/tdm-kevin ----- - -.directory structure ----- -tdm-kevin -├── imdb -│ ├── imdb.py -│ └── __init__.py -├── LICENSE -├── pyproject.toml -└── README.md - -1 directory, 5 files ----- - -The following are the starting contents of the author's `pyproject.toml`. - -.pyproject.toml ----- -[build-system] -requires = ["setuptools>=61.0.0", "wheel"] -build-backend = "setuptools.build_meta" - -[project] -name = "tdm-kevin" -version = "0.0.1" -description = "Get imdb ratings." -readme = "README.md" -authors = [{ name = "Kevin Amstutz", email = "kamstut@purdue.edu" }] -license = { file = "LICENSE" } -classifiers = [ - "License :: OSI Approved :: MIT License", - "Programming Language :: Python", - "Programming Language :: Python :: 3", -] -keywords = ["example", "imdb", "tutorial"] -dependencies = [ - "lxml >= 4.9.1", - "requests >= 2.28.1", -] -requires-python = ">=3.10" ----- - -If you look on Pypi, you will see that these bits of information directly correlate to different parts of https://pypi.org/project/tdm-kevin/0.0.1/[the associated project page]. For example, the `description` field shows up in a grey banner across the middle of the page. The `authors` appear in the meta section, etc. - -We want to take our package and go a new direction with it. We want it to end up being an API where we can query information about IMDB. Let's make the following modifications to our `pyproject.toml` file to reflect the new purpose of our package. - -. Update the `description` to describe the general idea of what our updated package will do. -. Update the contents of our `LICENSE` file to be "MIT License (MIT)" -- the text of our license was just too much on the rendered https://pypi.org/project/tdm-kevin/0.0.1/[project page]. -. Our API will use the https://fastapi.tiangolo.com/[FastAPI] package. Check out https://pypi.org/classifiers/[the pypi classifiers] and see if there is an appropriate "FastAPI" classifier. If so, please add it to our `classifiers`. -. Update the `keywords` to be any set of keywords you think is appropriate. No change is required. -. Update the `README.md` file to list the package name and a short description of the project. Could be anything, for now. - -Now, go ahead and test out our changes by building and publishing our package on https://test.pypi.org. - -. Open a terminal from within Jupyter Lab and run: `module load python/jupyterlab`. -. Create a virtual environment called `p10q01`. -. Activate the newly created virtual environment. -. Install `twine` and `build`: `python3 -m pip install twine build`. -. Build the package: `python3 -m build $HOME/tdm-kevin`. -. Check the package: `python3 -m twine check $HOME/tdm-kevin/dist/*`. -. Upload to https://test.pypi.org: `python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/* --verbose`. - -What happens in the very last step? Any ideas why this happens? The reason is that you can only upload a given version of your package only once! You already uploaded version 0.0.1 in the previous project, so this gives an error! Let's fix this in the following question. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Of course, you _could_ simply change the `version` section of your `pyproject.toml` file, as well as the `__version__` part of your `imdb.py` file, however, it makes more sense to do this programmatically. - -Install `bumpver` to your `p10q01` virtual environment. - -For this package, we will use https://semver.org/[semantic versioning]. Read the "Summary" section so you can get a quick overview. - -. Navigate to your project directory, for example, `cd $HOME/tdm-kevin`. -. Initiate `bumpver`: `python3 -m bumpver init`. -+ -This will add a new section to your `pyproject.toml` update the values to look similar to the following. -+ -.pyproject.toml ----- -[tool.bumpver] -current_version = "0.0.1" -version_pattern = "MAJOR.MINOR.PATCH" -commit_message = "Bump version {old_version} -> {new_version}" -commit = true -tag = true -push = false - -[tool.bumpver.file_patterns] -"pyproject.toml" = [ - 'current_version = "{version}"', - 'version = "{version}"', -] -"imdb/__init__.py" = [ - "{version}", -] ----- -+ -. Use `bumpver` to bump the version a patch number: `python3 -m bumpver update --patch`. -. Check out `pyproject.toml` and `__init__.py` and see how the version was increased -- cool! - -Finally, use `twine` to push your updates up to https://test.pypi.org followed by https://pypi.org. - -. Remove your old `dist` directory: `rm -rf $HOME/tdm-kevin/dist`. -. Build your package: `python3 -m build $HOME/tdm-kevin`. -. Upload to https://test.pypi.org: `python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/*` -. Check out your package on https://test.pypi.org to make sure it looks good. -. Once satisfied, use `twine` to upload to https://pypi.org: `python3 -m twine upload $HOME/tdm-kevin/dist/*`. -. Check the page out at https://pypi.org. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay! You now have version 0.0.2 of your package published. Cool beans. Let's add a barebones https://fastapi.tiangolo.com/[FastAPI API] that we will build on in future projects. - -In your `tdm-kevin/imdb` directory add the following two files. - -.\\__main__.py ----- -import argparse -import sys -import uvicorn - -def start_api(port: int): - uvicorn.run("imdb.api:app", port=port, log_level="info") - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - some_parser = subparsers.add_parser("imdb", help="") - some_parser.add_argument("-p", "--port", help="port to run on", type=int) - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "imdb": - start_api(port = args.port) - -if __name__ == "__main__": - main() ----- - -.api.py ----- -from fastapi import FastAPI -from fastapi.templating import Jinja2Templates - - -app = FastAPI() -templates = Jinja2Templates(directory='templates/') - - -@app.get("/") -async def root(): - """ - Returns a simple message, "Hello World!" - Returns: - dict: The response JSON. - """ - return {"message": "Hello World"} ----- - -Next, install the required packages to your `p10q01` virtual environment. - -[source,bash] ----- -module load libffi/3.3 -python3 -m pip install jinja2 lxml fastapi "uvicorn[standard]" ----- - -You are now ready to _run_ your API. First, navigate to your project directory. - -[source,bash] ----- -cd $HOME/tdm-kevin ----- - -Next, run the API. - -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 7777 ----- - -[IMPORTANT] -==== -If that command fails with an error stating "ERROR: Address already in use", this means that port 7777 is already in use. - -To easily find an available port that you can use, simply run the following. - -[source,bash] ----- -find_port ----- - -This will print out a port number that is available and ready to use. For example, if I got "50377" as the output, I would run the following. - -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 50377 ----- - -And, unless someone started using port 50377 in the time it took to find a port and execute that line, it should work. -==== - -Alright, if it is working and running, open a new terminal and test it out! - -[source,bash] ----- -curl http://127.0.0.1:7777 - -# or if you are using a different port -curl http://127.0.0.1:50377 ----- - -Great! Let's kill our API by holding Ctrl on your keyboard and then pressing "c". - -Once killed, let's call this a minor upgrade and bump our version by a minor version bump. Use `bumpver` to increase our version by a minor release. - -[source,bash] ----- -cd $HOME/tdm-kevin -python3 -m bumpver update --minor ----- - -Next, let's build and push up our new package version 0.1.0! - -[source,bash] ----- -cd $HOME -rm -rf $HOME/tdm-kevin/dist -python3 -m build $HOME/tdm-kevin -python3 -m twine upload -r testpypi $HOME/tdm-kevin/dist/* - -# if all looks well at test.pypi.org -python3 -m twine upload $HOME/tdm-kevin/dist/* ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Create a new virtual environment called `p10q04`, activate the new environment, and install your package. For example, I would run the following. - -[source,bash] ----- -deactivate -module load python/jupyterlab -cd $HOME -python3 -m venv $HOME/p10q04 -source $HOME/p10q04/bin/activate -python3 -m pip install tdm-kevin ----- - -Now, let's try to run our API. - -[source,bash] ----- -python3 -m imdb ----- - -Uh oh! You probably got an error that `uvicorn` was not found! We forgot to list those extra packages as dependencies! In addition to all of that, let's make it so we can run a simple command to run our API. One thing at a time. - -First, open up your `pyproject.toml` file and update your `dependencies` to include: `fastapi>=0.85.2`, `Jinja2>=3.1.2`, `lxml>=4.9.1`, `uvicorn[standard]`. This should make it so that all of the required packages are installed into your virtual environment upon installing `tdm-kevin` (or your equivalent `tdm-` package). - -Next, add the following to your `pyproject.toml`. - ----- -[project.scripts] -run_api = "imdb.__main__:main" ----- - -This _should_ make it so after you've installed the package you can simply run something like the following in order to run the API. - -[source,bash] ----- -run_api imdb --port=7777 ----- - -Let's test it all out! - -[source,bash] ----- -cd $HOME -deactivate -source $HOME/p10q01/bin/activate -rm -rf $HOME/tdm-kevin/dist -cd $HOME/tdm-kevin -python3 -m bumpver update --patch -cd $HOME -python3 -m build $HOME/tdm-kevin -python3 -m twine upload -r $HOME/tdm-kevin/dist/* - -# if https://test.pypi.org looks good -python3 -m twine upload $HOME/tdm-kevin/dist/* ----- - -Excellent! You've just published version 0.1.1 of your package! Let's see if things worked out. - -Deactivate your virtual environment, create a new environment called `p10`, activate the environment, and install your package. For example, I would run the following. - -[source,bash] ----- -deactivate -module load python/jupyterlab -python3 -m venv $HOME/p10 -source $HOME/p10/bin/activate -module load libffi/3.3 -python3 -m pip install tdm-kevin ----- - -[WARNING] -==== -The `module load libffi/3.3` command is critical, otherwise you will likely run into an error installing your package. -==== - -Now, go ahead and give things a shot! - -[source,bash] ----- -run_api imdb --port=7777 ----- - -Very cool! Congratulations! You can use this package as a template for any other packages you may want to write! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -We've covered a _lot_ in a very short amount of time. Which parts of the last 3 projects would you want more instruction on? What lingering questions do you have? Please write at least 1 question that you'd like to have answered about the previous few projects. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc deleted file mode 100644 index 9ca18f6d0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project11.adoc +++ /dev/null @@ -1,294 +0,0 @@ -= TDM 30100: Project 11 -- 2022 - -**Motivation:** One of the primary ways to get and interact with data today is via APIs. APIs provide a way to access data and functionality from other applications. There are 3 very popular types of APIs that you will likely encounter in your work: RESTful APIs, GraphQL APIs, and gRPC APIs. Our focus for the remainder of the semester will be on RESTful APIs. - -**Context:** We are working on a series of projects that focus on building and using RESTful APIs. We will learn some basics about interacting and using APIs, and even build our own API. - -**Scope:** Python, APIs, requests, fastapi - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -We updated this project on Tuesday. This update includes changes to the commands run on a windows machine in question (3). We updated the commands as well as the shell (powershell instead of cmd). As noted in the updated question, please make sure to run powershell as an administrator before running the provided commands. -==== - -=== Question 1 - -[WARNING] -==== -If at any time you get stuck, please make a Piazza post and we will help you out! -==== - -For this project, we will be doing something a little different in order to _try_ to make the API development experience on Anvil more pleasant. In addition, I imagine many of you will enjoy what we are going to setup and use it for other projects (or maybe even corporate partners projects). - -Typically, when developing an API, you will have a set of code that you will update and modify. To see the results, you will run your API on a certain _port_ (for example 7777), and then interact with the API using a _client_. The most typical client is probably a web browser. So if we had an API running on port 7777, we could interact with it by navigating to `http://localhost:7777` in our browser. - -This is not so simple to do on Anvil, or at least not very enjoyable. While there are a variety of ways, the easiest is to use the "Desktop" app on https://ondemand.anvil.rcac.purdue.edu and use the provided editor and browser on the slow and clunky web interface. This is not ideal, and is what we want to avoid. - -Don't just take our word for it, try it out. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Desktop" under "Interactive Apps". Choose the following: - -- Allocation: "cis220051" -- Queue: "shared" -- Wall Time in Hours: 1 -- Cores: 1 - -Then, click on the "Launch" button. Wait a minute and click on the "Launch Desktop" button when it appears. - -Now, lets copy over our example API and run it. - -. Click on Applications > Terminal Emulator -. Run the following commands: -+ -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 -cp -a /anvil/projects/tdm/etc/hithere $HOME -cd $HOME/hithere ----- -+ -. Then, find an unused port by running the following: -+ -[source,bash] ----- -find_port # 50087 ----- -+ -. In our example the output was 50087. Now run the API using that port (the port _you_ found). -+ -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 50087 ----- - -Finally, the last step is to open a browser and check out the API. - -. Click on Applications > Web Browser -. First navigate to `localhost:50087` -. Next navigate to `localhost:50087/hithere/yourname` - -From here, your development process would be to modify the Python files, let the API reload with the changes, and interact with the API using the browser. This is all pretty clunky due to the slowness of the desktop-in-browser experience. In the remainder of this project we will setup something more pleasant. - -For this question, submit a screenshot of your work environment on https://ondemand.anvil.rcac.purdue.edu using the "Desktop" app. It would be best to include both the browser and terminal in the screenshot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The first step in this process is an easy one. Install https://code.visualstudio.com/[VS Code] on your local machine. This is a free, open source, and cross-platform editor. It is very popular and has a lot of great features that make it easy and enjoyable to use. - -For this question, submit a screenshot of your local machine with a VS Code window open. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You may be wondering how we are going to use VS Code on your _local_ machine to develop on Anvil. The answer is we are going to use a tool called `ssh` along with a VSCode extension to make this process seamless. - -Read through https://the-examples-book.com/starter-guides/unix/ssh[this] page in order to gain a cursory knowledge of `ssh` and how to create public/private key pairs. Generate a public/private key pair on your local machine and add your public key to Anvil. For convenience, we've highlighted the steps below for both Mac and Windows. - -**Mac** - -. Open a terminal window on your local machine. -. Run the following command to generate a public/private key pair: -+ -[source,bash] ----- -ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 ----- -+ -. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, feel free). -. Display the public key contents: -+ -[source,bash] ----- -cat ~/.ssh/id_ed25519.pub ----- -+ -. Highlight the contents of the public key and copy it to your clipboard. -. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access". -. Once presented with a terminal, run the following. -+ -[source,bash] ----- -mkdir ~/.ssh -vim ~/.ssh/authorized_keys - -# press "i" (for insert) then paste the contents of your public key on a newline -# then press Ctrl+c, and type ":wq" to save and quit - -# set the permissions -chmod 700 ~/.ssh -chmod 644 ~/.ssh/authorized_keys -chmod 644 ~/.ssh/known_hosts -chmod 644 ~/.ssh/config -chmod 600 ~/.ssh/id_ed25519 -chmod 644 ~/.ssh/id_ed25519.pub ----- -. Now, confirm that it works by opening a terminal on your local machine and typing the following. -+ -[source,bash] ----- -ssh username@anvil.rcac.purdue.edu ----- -+ -. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut". -. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool! - -**Windows** - -https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_keymanagement[This] article may be useful. - -. Open a powershell by right clicking on the powershell app and choosing "Run as administrator". -. Run the following command to generate a public/private key pair: -+ -[source,bash] ----- -ssh-keygen -a 100 -t ed25519 ----- -+ -. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, feel free). -. We need to make sure the permissions are correct for your `.ssh` directory and the files therein, otherwise `ssh` will not work properly. Run the following commands from a powershell (again, make sure powershell is running as administrator by right clicking and choosing "Run as administrator"): -+ -[source,bash] ----- -# from inside a powershell -# taken from: https://superuser.com/a/1329702 -New-Variable -Name Key -Value "$env:UserProfile\.ssh\id_ed25519" -Icacls $Key /c /t /Inheritance:d -Icacls $Key /c /t /Grant ${env:UserName}:F -TakeOwn /F $Key -Icacls $Key /c /t /Grant:r ${env:UserName}:F -Icacls $Key /c /t /Remove:g Administrator "Authenticated Users" BUILTIN\Administrators BUILTIN Everyone System Users -# verify -Icacls $Key -Remove-Variable -Name Key ----- -+ -. Display the public key contents: -+ -[source,bash] ----- -type %USERPROFILE%\.ssh\id_ed25519.pub ----- -+ -. Highlight the contents of the public key and copy it to your clipboard. -. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access". -. Once presented with a terminal, run the following. -+ -[source,bash] ----- -mkdir ~/.ssh -vim ~/.ssh/authorized_keys - -# press "i" (for insert) then paste the contents of your public key on a newline -# then press Ctrl+c, and type ":wq" to save and quit - -# set the permissions -chmod 700 ~/.ssh -chmod 644 ~/.ssh/authorized_keys -chmod 644 ~/.ssh/known_hosts -chmod 644 ~/.ssh/config -chmod 600 ~/.ssh/id_ed25519 -chmod 644 ~/.ssh/id_ed25519.pub ----- -. Now, confirm that it works by opening a command prompt on your local machine and typing the following. -+ -[source,bash] ----- -ssh username@anvil.rcac.purdue.edu ----- -+ -. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut". -. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool! - -For this question, just include a sentence in a markdown cell stating whether or not you were able to get this working. If it is not working, the next question won't work either, so please post in Piazza for someone to help! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Finally, let's install the "Remote Explorer" or "Remote SSH" extension in VS Code. This extension will allow us to connect to Anvil from VS Code and develop on Anvil from our local machine. Once installed, click on the icon on the left-hand side of VS Code that looks like a computer screen. - -In the new menu on the left, click the little settings cog. Select the first option, which should be either `/Users/username/.ssh/config` (if on a mac) or `C:\Users\username\.ssh\config` (if on windows). This will open a file in VS Code. Add the following to the file: - -.mac config ----- -Host anvil - HostName anvil.rcac.purdue.edu - User username - IdentityFile ~/.ssh/id_ed25519 ----- - -.windows config ----- -Host anvil - HostName anvil.rcac.purdue.edu - User username - IdentityFile C:\Users\username\.ssh\id_ed25519 ----- - -Save the file and close out of it. Now, if all is well, you will see an "anvil" option under the "SSH TARGETS" menu. Right click on "anvil" and click "Connect to Host in Current Window". Wow! You will now be connected to Anvil! Try opening a file -- notice how the files are the files you have on Anvil -- that is super cool! - -Open a terminal in VS Code by pressing `Cmd+Shift+P` (or `Ctrl+Shift+P` on Windows) and typing "terminal". You should see a "Terminal: Create new terminal" option appear. Select it and you should notice a terminal opening at the bottom of your vscode window. That terminal is on Anvil too! Way cool! Run the api by running the following in the new terminal: - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 -cd $HOME/hithere -python3 -m uvicorn imdb.api:app --reload --port 50087 ----- - -If you are prompted something about port forwarding allow it. In addition open up a browser on your own computer and test out the following links: `localhost:50087` and `localhost:50087/hithere/bob`. Wow! VS Code even takes care of forwarding ports so you can access the API from the comfort of your own computer and browser! This will be extremely useful for the rest of the semester! - -For this question, submit a couple of screenshots demonstrating opening code on Anvil from VS Code on your local computer, and accessing the API from your local browser. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -There are tons of cool extensions and themes in VS Code. Go ahead and apply a new theme you like and download some extensions. - -For this question, submit a screenshot of your tricked out VS Code setup with some Python code open. Have some fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc deleted file mode 100644 index 0762af19a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project12.adoc +++ /dev/null @@ -1,263 +0,0 @@ -= TDM 30100: Project 12 -- 2022 - -**Motivation:** RESTful APIs are everywhere! At some point in time, it will likely be the source of some data that you will want to use. What better way to understand how to interact with APIs than building your own? - -**Context:** This is the second to last project in a series around APIs. In this project, we will build a minimal API that does some basic operations, and in the following project we will build on top of that API and use _templates_ to build a "frontend" for our API. - -**Scope:** Python, fastapi, VSCode - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -== Questions - -=== Question 1 - -Let's start by setting up our API, and getting a few things configured. This project will assume that you were able to connect and setup VSCode in the previous project. If you didn't do this, please go back and do that now. That project (aside from some initially incorrect windows commands) is pretty straightforward and at the end you have a super cool setup and easy way to work on Anvil using VSCode! - -. Open VSCode and connect to Anvil. -. Hold Cmd+Shift+P (or Ctrl+Shift+P) to open the command palette, search for "Terminal: Create new terminal" and hit enter. This will open a terminal in VSCode that is connected to Anvil. -. Copy over our project template into your `$HOME` directory. -+ -[source,bash] ----- -cp -r /anvil/projects/tdm/etc/imdb $HOME ----- -+ -. Open the `$HOME/imdb` directory in VSCode. -. Load up our `f2022-s2023` Python environment by running the following in the terminal in VSCode. -+ -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- -+ -. Go ahead and test out the provided, minimal API by running the following in the terminal in VSCode. -+ -[source,bash] ----- -find_port # returns a port, like 7777 ----- -+ -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 7777 # replace 7777 with the port you got from find_port ----- -+ -. Open a browser on your computer and navigate to `localhost:7777`, but be sure to replace 7777 with the port you got from `find_port`. You should see a message that says "Hello World!". This is a JSON response, which is why your browser is showing it in a nice format. - -No need to turn anything in for this question -- it is integral for all of the remaining questions. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -If you check out `api.py` you will find two functions. The `root` function is responsible for the "Hello World" message you received in the previous question. As you can see, we returned the JSON response, which caused the data to be rendered that way. JSON is a data format that is _very_ common in RESTful APIs. For simiplicity, we will be using JSON for all of our responses, today. - -The other function, `read_item`, is responsible for the "hi there" stuff from the previous project. If you navigate to `localhost:7777/hithere/alice` you will see a webpage that displays the name "alice". The url path parameter `{name}` turns into a variable. So if you changed `alice` to `joe`, you would see "hi there joe" instead. This is a very common pattern in RESTful APIs. - -We are going to keep things as "simple" as possible, because there are so many components to this sort of project that it is easy to get confused and have something go wrong. Our goal is to wire our API up to the database, `imdb.db`, and create an endpoint that returns some structured data (as JSON). - -It is _highly_ recommended to go through the https://fastapi.tiangolo.com/tutorial/[official documentation tour]. It is well written and may provide examples that help you understand something we do for this project, better. - -Let's start with our problem statement. We want links like the following to display data about the given title: `localhost:7777/title/tt8946378`. Where `tt8946378` is the imdb.com title id for "Knives Out". Specifically, we want to start by displaying the: `primary_title`, `premiered`, and `runtime_minutes` from the `titles` table. - -Before we can even think about displaying the data, we need to wire up our database. Create a new file in the `imdb` directory called `database.py`. Include the following content. - -[source,python] ----- -import os -import aiosql -import sqlite3 -from dotenv import load_dotenv -from pathlib import Path - -load_dotenv() - -database_path = Path(os.getenv("DATABASE_PATH")) -queries = aiosql.from_path(Path(__file__).parents[0] / "queries.sql", "sqlite3") ----- - -We are going to use the https://nackjicholson.github.io/aiosql/[`aiosql`] package to make queries to our database. This package is extremely simple (compared to other packages) and has (in my opinion) the best separation of SQL and Python code. It is also very easy to use (compared to other packages, at least). Let's walk through the code. - -. `load_dotenv()` load the environment variables from a `.env` file. Classically, the `.env` file is used to store sensitive credentials, like database passwords. In our case, our database has no password, so to demonstrate we are going to put our database path in an environment variable instead. -+ -[IMPORTANT] -==== -We haven't created a `.env` file yet, let's do that now! Create a text file named `.env` in your root directory (the outer `imdb` folder) and add the following contents: - -.env ----- -DATABASE_PATH=/anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -Now, after `load_dotenv()` is called, the `os.getenv("DATABASE_PATH")` will return the path to our database, `/anvil/projects/tdm/data/movies_and_tv/imdb.db`. -==== -+ -. `database_path` is simply the path loaded into a variable. -. `queries` is an object that load up all of our SQL queries from a future `queries.sql` file, and allows us to easily make SQL queries from inside Python. We will give example of this later. - -Thats it! We can then import the `queries` object in our other Python modules in order to make queries, cool! - -No need to submit anything for this question either. The `database.py` file will be submitted at the end. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, we "wired" our database up, but we need to actually make a query that returns all of the information we want to display, right? - -Create a new file called `queries.sql` in the inner `imdb` directory. This file will contain all of our SQL queries. The "comments" inside this file are critical for our `aiosql` package to identify the queries and load them into our `queries` object. The following is an example of a `queries.sql` file and Python code that uses it to make queries on a fake database. - -.queries.sql ----- --- name: get_name_and_age --- Get name and age of object. -SELECT name, age FROM my_table WHERE myid = :myid; ----- - -[source,python] ----- -conn = sqlite3.connect("fake.db") -queries = aiosql.from_path("queries.sql", "sqlite3") -results = queries.get_name(conn, myid=1) -conn.close() -print(results) - -# or, the following, which automatically closes the connection - -queries = aiosql.from_path("queries.sql", "sqlite3") -conn = sqlite3.connect("fake.db") -with conn as c: - results = queries.get_name(c, myid=1) - -print(results) ----- - -.output ----- -[("bob", 42), ("alice", 37)] ----- - -Add a query called `get_title` to your `queries.sql` file. This query should return the `primary_title`, `premiered`, and `runtime_minutes` from the `titles` table. - -In your `api.py` file, add a new function that will be used to eventually return a JSON with the title information. Call this function `get_title`. - -For now, just use the `queries` object to make a query to the database, and have the function return whatever the query returns. Once implemented, test it out by navigating to `localhost:7777/title/tt8946378` in your browser. You should see a (incorrectly) rendered response, with the info we wanted to display. We are getting there! - -For this question, include a screenshot like the following, but for a different title. - -image::figure37.webp[Example output, width=792, height=500, loading=lazy, title="Example output"] - -[NOTE] -==== -If you use Chrome, your screenshot may look a bit different, that is OK. -==== - -[TIP] -==== -The `read_item` is very similar, just more complicated than our `get_title` function. You can use it as a reference. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay! We were able to display our data, but it is not formatted correctly, and without any context, it is hard to say what 130 represents (runtime in minutes). Let's fix that by using the `pydantic` package to create a `Title` model. This model will be used to format our data before it is returned to the user. It is good practice to have all _responses_ be formatted using `pydantic` -- that way data is always returned in a consistent, expected format. - -Read https://fastapi.tiangolo.com/tutorial/sql-databases/?h=pydantic#create-pydantic-models-schemas-for-reading-returning[this] section of the offical documentation. - -Create a new file called `schemas.py` in the `imdb` directory. In this file, create a `Title` model that has all of the fields we want to display. - -In your `api.py` file, update your `get_title` function to return a `Title` object instead of the raw data from the database. - -[TIP] -==== -To take a query result and convert it to a `pydantic` model, do the following (for example). - -[source,python] ----- -queries = aiosql.from_path("queries.sql", "sqlite3") -conn = sqlite3.connect("fake.db") -with conn as c: - results = queries.get_name(c, myid=1) - -results = {key: result[0][i] for i, key in enumerate(MyModel.__fields__.keys())} -my_model = MyModel(**results) ----- -==== - -Navigate to `localhost:7777/title/tt8946378` in your browser. You should see a correctly formatted response, with the info we wanted to display. Your result should look like the following image, but for a different title. - -image::figure38.webp[Example output, width=792, height=500, loading=lazy, title="Example output"] - -Please submit the following things for this project. - -- A `.ipynb` file with a screenshot for question 3 and 4 added. -- Your `api.py` file. -- Your `database.py` file. -- Your `queries.sql` file. -- Your `schemas.py` file. - -Congratulations! You should feel accomplished! While it may not _feel_ like you did much, you wired together a database and backend API, made SQL queries from within Python, and formatted your data using `pydantic` models. That is a lot of work! Great job! Happy thanksgiving! - -[IMPORTANT] -==== -If you have any questions, please post in Piazza and we will do our best to help you out! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (optional, 0 points) - -Read https://fastapi.tiangolo.com/tutorial/sql-databases/?h=pydantic#__tabbed_2_3[the documentation] and update your API to include the `genres` in your response! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:projects:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc deleted file mode 100644 index 50107bdc6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-project13.adoc +++ /dev/null @@ -1,196 +0,0 @@ -= TDM 30100: Project 13 -- 2022 - -**Motivation:** RESTful APIs are everywhere! At some point in time, it will likely be the source of some data that you will want to use. What better way to understand how to interact with APIs than building your own? - -**Context:** This is the last project in a series around APIs. In this project, we will use templates and `jinja2` to build a "frontend" for our API. - -**Scope:** Python, fastapi, VSCode - -.Learning Objectives -**** -- Understand and use the HTTP methods with the `requests` library. -- Differentiate between graphql, REST APIs, and gRPC. -- Write REST APIs using the `fastapi` library to deliver data and functionality to a client. -- Identify the various components of a URL. -- Use a templating engine and HTML to display data from our API. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -In addition, the following is an illustration of the database to help you understand the data. - -image::figure14.webp[Database diagram from https://dbdiagram.io, width=792, height=500, loading=lazy, title="Database diagram from https://dbdiagram.io"] - -For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above. - -== Questions - -=== Question 1 - -For this project, we've provided you with a ready-made API, very similar to the results of the previous project, but with a bit more to work with. You'll be relieved to hear, however, that you will be primarily working with a discrete set of HTML template files, and not much else. - -Start this project just like you did in the previous project. - -. Open VSCode and connect to Anvil. -. Hold Cmd+Shift+P (or Ctrl+Shift+P) to open the command palette, search for "Terminal: Create new terminal" and hit enter. This will open a terminal in VSCode that is connected to Anvil. -. Copy over our project template into your `$HOME` directory. -+ -[source,bash] ----- -cp -r /anvil/projects/tdm/etc/imdb2 $HOME ----- -+ -. Open the `$HOME/imdb` directory in VSCode. -. Load up our `f2022-s2023` Python environment by running the following in the terminal in VSCode. -+ -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- -+ -. Go ahead and test out the provided, minimal API by running the following in the terminal in VSCode. -+ -[source,bash] ----- -find_port # returns a port, like 7777 ----- -+ -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 7777 # replace 7777 with the port you got from find_port ----- -+ -. Open a browser on your computer and navigate to `localhost:7777`, but be sure to replace 7777 with the port you got from `find_port`. You should see a message that says "Hello World!". This is a JSON response, which is why your browser is showing it in a nice format. - -Finally, check out the new code base and the following new endpoints. - -- `localhost:7777/api/titles/tt4236770` -- `localhost:7777/api/cast/tt4236770` -- `localhost:7777//api/person/nm0000126` - -Like before, these endpoints all return appropriately formatted JSON objects. Within our Python code, we have nice `pydantic` objects to work with. However, we want to display all of this data in a nice, human-readable format. This is often referred to as a _frontend_. Often times a frontend will use a completely different set of technologies, and simply use the API to fetch specially structure data. In this case, we are going to use `fastapi` to build our frontend. We can do this by using a templating engine (built into `fastapi`), and in this case, we will be using `jinja2`. - -For this question, you don't need to submit anything, as you'll need to have all of it working to continue. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Check out the one and only template provided in the `templates` directory, `hithere.html`. When navigating to `localhost:7777/hithere/alice` you'll be greeted with a message saying "Hi there: alice!". - -[IMPORTANT] -==== -You will need to update the `hithere` function URL to use the port you are using instead of port 7777. -==== - -The content of the template is simple. - -[source,html] ----- -<html> - <head> - <title>Hi there! - - -

Hi there: {{ name.my_name }}!

- - ----- - -When you navigate to `localhost:7777/hithere/alice` `fastapi` sends a request to our api endpoint `localhost:7777/api/hithere/alice` and send the response to our template, `hithere.html`. The template can then access the name by surrounding the variable with double curly braces and dot notation. - -This whole process emulates what a regular frontend would do. First make a request to get the data (in our case, in JSON format), then pass the response to some sort of frontend system (in our case a template engine that chooses how to display the data). - -Let's start by creating a single new HTML template called `title.html`. This template will be used to display the information about a single title. The template should be located in the `templates` directory. Let's start the template with a basic HTML skeleton. - -[source,html] ----- - - - Title - - -

- - ----- - -Create a new endpoint in `api.py`: `localhost:7777/titles/{some_title_id}`. This endpoint should behave similarly to the `hithere` function. It should first make a request to our api, `localhost:7777/api/titles/{some_title_id}`, and then pass the response along to the `title.html` template. - -Once complete, go back to your `title.html` template, and modify it so it displays the `primary_title` in an `h1` tag. In addition, display the rest of the data _except_ the `genres`. You can choose how to display, or rather, what HTML tags to use to display the remaining data. - -Test it out by navigating to: `localhost:7777/titles/tt4236770`. - -[TIP] -==== -Check out this post for examples on accessing data, using conditionals (if/else), and loops in `jinja2`. - -https://realpython.com/primer-on-jinja-templating/#get-started-with-jinja -==== - -.Items to submit -==== -- A screenshot displaying the webpage for `localhost:7777/titles/tt4236770`. -==== - -=== Question 3 - -In the previous question, you learned how to take a request and modify the template to display the structured data returned from the request (the response) using `jinja2` templating. - -In the previous question, you displayed data for a title _except_ for the genre data. The genre data is a list of strings. To access the genres from within a `jinja2` template, you will need to loop through the genres and display them. See https://realpython.com/primer-on-jinja-templating/#leverage-for-loops[this] article for an example. _How_ you decide to display the data (what HTML tags to use) is up to you! - -.Items to submit -==== -- A screenshot displaying the webpage for `localhost:7777/titles/tt4236770`. -==== - -=== Question 4 - -Practice makes perfect. Create a new template called `person.html`. As you may guess, we want this template to display the name of the person of interest, and a list of the `primary_title` for all of their works. Create a new endpoint at `localhost:7777/person/{some_person_id}`. This endpoint should first make a request to our api at `localhost:7777/api/person/{some_person_id}` and then pass the response along to the `person.html` template. - -How you display the data is up to you. I displayed the name of the person in a big h1 tag and listed all of the `primary_title` data in a list of p tags. It doesn't need to be pretty! - -.Items to submit -==== -- A screenshot displaying the webpage for `localhost:7777/person/nm0000126`. -==== - -=== Question 5 - -Create a new template called `cast.html`. As you may guess, we want this template to display the cast for a given a title. Create a new endpoint at `localhost:7777/cast/{some_title_id}`. This endpoint should first make a request to our api at `localhost:7777/api/cast/{some_title_id}` and then pass the response along to the `cast.html` template. - -This should be _extremely_ similar to question (3)! Please have a nice h1 header with the name of the title, and a list of cast members. We are only going to include 1 small twist. For every cast member name you display, make the cast member name itself be a link that links back to the person's page (created in the previous question). This way, when you navigate to `localhost:7777/cast/tt4236770`, you can click on any of the cast member names and be taken to their page. Very cool! - -.Items to submit -==== -- A screenshot displaying the webpage for `http://localhost:7777/cast/tt4236770`. -- A screenshot displaying the webpage for one of the cast members (someone other than Kevin Costner). -==== - -=== Question 6 (optional, 0 points) - -Update the `title.html` template so that the primary title is displayed in green if the rating of the title is 8.0 or higher, and red otherwise. - -.Items to submit -==== -- A screenshot displaying an instance where the page is displayed in green and an instance where the page is displayed in red. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:projects:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc deleted file mode 100644 index 7050efaa8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/30100/30100-2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= TDM 30100 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$30100-2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm30100[https://piazza.com/purdue/fall2022/tdm30100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm30100/home[https://piazza.com/purdue/fall2022/tdm30100/home] - -== Syllabus - -See xref:fall2022/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc deleted file mode 100644 index 98e34cbef..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project01.adoc +++ /dev/null @@ -1,311 +0,0 @@ -= TDM 40100: Project 1 -- 2022 - -**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python _with_ data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time. - -We will start slowly, however, by learning about Jupyter Lab. This year, instead of using RStudio Server, we will be using Jupyter Lab. In this project we will become familiar with the new environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about a variety of useful and exciting topics. - -**Scope:** Jupyter Lab, R, Python, Anvil, markdown, lmod - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Review, mess around with `lmod`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for the Anvil "sub-clusters". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[NOTE] -==== -Last year, we used the https://www.rcac.purdue.edu/compute/brown[Brown computing cluster]. Compare the specs of https://www.rcac.purdue.edu/compute/anvil[Anvil] and https://www.rcac.purdue.edu/compute/brown[Brown] -- which one is more powerful? -==== - -.Items to submit -==== -- A sentence explaining how many cores and how much memory is available, in total, across all nodes in the sub-clusters on Anvil. -- A sentence explaining how many cores and how much memory is available, in total, for your own computer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Like the previous year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb. - -[NOTE] -==== -It is OK to not understand what that means yet, we will learn more about this in TDM 30100. For the curious, however, if you were to open a terminal session in Anvil and run the following, you would see your job queued up. - -[source,bash] ----- -squeue -u username # replace 'username' with your username ----- -==== - -[NOTE] -==== -If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2022-s2023:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -f2022-s2023-r:: -An alternative, native R kernel that you can use for projects with _just_ R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you. - -[NOTE] -==== -Soon, we'll have the f2022-s2023-r kernel available and ready to use! -==== - -Test it out! Run the following code in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on. What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Code used to solve this problem in a "code" cell. -- Output from running the code (the name of the node on Anvil that you are running on). -==== - -=== Question 3 - -++++ - -++++ - -++++ - -++++ - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know! - -Practice running the following examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -[NOTE] -==== -In a previous semester, you'd need to load the sql extension first -- this is no longer needed as we've made a few improvements! - -[source,ipython] ----- -%load_ext sql ----- -==== - -bash:: -[source,bash] ----- -%%bash - -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv ----- - -[TIP] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== Question 5 - -Make a markdown cell containing a list of every topic and/or tool you wish was taught in The Data Mine -- in order of _most_ interested to _least_ interested. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. - -[NOTE] -==== -Your `bash` solution can be both plotless and without a custom function. -==== - -Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 - -++++ - -++++ - -The module system, `lmod`, is extremely popular on HPC (high performance computing) systems. Anvil is no exception! - -In a terminal, take a look at the modules available to you by default. - -[source,bash] ----- -module avail ----- - -Notice that at the very top, you'll have a list named: `/anvil/projects/tdm/opt/lmod`. - -Now run the following. - -[source,bash] ----- -module reset -module avail ----- - -Notice how the set of available modules changes! By default, we have it loaded up with some Datamine-specific modules. To manually load up those modules, run the following. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module avail ----- - -Notice how at the very top, there is a new section named `/anvil/projects/tdm/opt/core` with a single option, `tdm/default`. - -Go ahead and load up `tdm/default`. - -[source,bash] ----- -module load tdm -module avail ----- - -It looks like we are (pretty much) back to where we started off! This is useful to know in case there is ever a situation where you'd like to SSH into Anvil and load up our version of Python with the packages we have ready-made for you to use. - -To finish off this "question", run the following and make a note in your notebook what the result is. - -[source,bash] ----- -which python3 ----- - -Okay, now, load up our `python/f2022-s2023` module and run `which python3` once again. What is the result? Surprised by the result? Any ideas what this is doing? If you are curious, feel free to ask in Piazza! Otherwise, congratulations, you've made it through the first project! - -.Items to submit -==== -- `which python3` before and after loading the `python/f2022-s2023` module. -- Any other commentary you'd like to include. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc deleted file mode 100644 index 06dbde988..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project02.adoc +++ /dev/null @@ -1,231 +0,0 @@ -= TDM 40100: Project 2 -- 2022 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own `sqlite3` database file. -- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data. -- Insert data into your database. -- Run one or more queries to test out the end result. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -=== Question 1 - -The goodreads dataset has a variety of files: `/anvil/projects/tdm/data/goodreads/original`. With that being said there are 4 files which hold the bulk of the data. The rest is _mostly_ derivitives of those 4 files. - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -Take a look at the 4 files included in this dataset. How many bytes of data total do the 4 files take up on the filesystem? - -[TIP] -==== -You can use `du` in a `bash` cell to get this information. -==== - -_Approximately_ how many books and how many reviews are included in the datasets? - -Finally, take a look at the first book. - ----- -{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"} ----- - -As you can see, there is an `image_url` included for each book. Use `bash` tools to download one of the images to `$HOME/p02output`. How much space does it take up (in bytes)? - -[TIP] -==== -Use `wget` to download the image. Rather than using `cd` to first navigate to `$HOME/p02output` before using `wget` to download the image, instead, use a `wget` _option_ to specify the directory to download to. -==== - -[NOTE] -==== -It is okay to manually copy/paste the link from the json. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -We decided we want to download more than 1 image in order to approximate how much space the images take on average. - -[IMPORTANT] -==== -In the previous question we said it was okay to manually copy/paste the `image_url` -- this time, you _probably_ won't want to do that. You can use a `bash` tool called `jq` to extract the links automatically. `jq` is located `/anvil/projects/tdm/bin/jq`. - -The `--raw-output` option to `jq` will be useful as well. -==== - -Use `bash` tools (and only `bash` tools, from within a `bash` cell) to download 25 **random** book images to `$HOME/p02output`, and calculate the average amount of space that each image takes up. Use that information to estimate how much space it would take to store the images for all of the book in the dataset. - -[TIP] -==== -Take a look at the `shuf` command in `bash`: `man shuf`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, so _roughly_, in total, we are looking at around 34 gb of data. With that size it will _definitely_ be useful for us to create a database. After all, answering questions like: - -- What is the average rating of Brandon Sandersons books? -- What are the titles of the 5 books with the most number of ratings? -- Etc. - -Is not very straightforward if we handed you this data and said "get that info please", _but_, if we had a nice `sqlite` database -- it would be easy! So let's start planning this out. - -First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it. - -Use `shuf` to get a random sample of the `goodreads_books.json` and `goodreads_reviews_dedup.json` datasets. Approximate how many rows you'd need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of `goodreads_book_authors.json` and `goodreads_book_series.json` in `$HOME/goodreads_samples`. - -[NOTE] -==== -It just needs to be approximately 100mb -- no need to fuss, as long as it is within 50mb we are good. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Check out the 5 storage classes (which you can think of as types) that `sqlite3` uses: https://www.sqlite.org/datatype3.html - -In a markdown cell, write out each of the keys in each of the json files, and list the appropriate storage class to use. For example, I've provided you an example of what we are looking for, for the `goodreads_reviews_dedup.json`. - -- user_id: TEXT -- book_id: INTEGER -- review_id: TEXT -- rating: INTEGER -- review_text: TEXT -- date_added: TEXT -- date_updated: TEXT -- read_at: TEXT -- started_at: TEXT -- n_votes: INTEGER -- n_comments: INTEGER - -[NOTE] -==== -You don't need to copy/paste the solution for `goodreads_reviews_dedup.json` since we provided it for you. -==== - -[IMPORTANT] -==== -You do not need to assign a type to the following keys in `goodreads_books.json`: `series`, `popular_shelves`, `similar_books`, and `authors`. -==== - -[TIP] -==== -- Assume `isbn`, `asin`, `kindle_asin`, `isbn13` columns _could_ start with a leading 0. -- Assume any column ending in `_id` could _not_ start with a leading 0. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -[WARNING] -==== -Please include the `CREATE TABLE` statements in code cells for this question, but realize that you will have to pop open a terminal and launch `sqlite3` to complete this problem. - -To do so run the following in the new terminal. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load sqlite/3.39.2 - -sqlite3 my.db # this will create an empty database ----- - -You will then be inside a `sqlite3` session and able to run `sqlite`-specific dot functions (which you can see after running `.help`), or SQL queries. -==== - -For now, let's ignore the "problematic" columns in the `goodreads_books.json` dataset (`series`, `popular_shelves`, `similar_books`, and `authors`). - -Translate the work you did in the previous question to 4 `CREATE TABLE` statements that will be used to create your `sqlite3` database tables. Check out some examples https://www.sqlitetutorial.net/sqlite-create-table/[here]. For now, let's keep it straightforward -- ignore primary and foreign keys, and just focus on building the 4 tables with the correct types. Similarly, don't worry about any of the restrictions like `NOT NULL` or `UNIQUE`. Name your tables: `reviews`, `books`, `series`, and `authors`. - -Once you've created your `CREATE TABLE` statements, create a database called `my.db` in your `$HOME` directory -- so `$HOME/my.db`. Run your `CREATE TABLE` statements, and, in your notebook, verify the database has been created properly by running the following. - -[source,ipython] ----- -%sql sqlite:////home/x-kamstut/my.db # change x-kamstut to your username ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='reviews'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='books'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='series'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='authors'; ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc deleted file mode 100644 index 3c2908bb7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project03.adoc +++ /dev/null @@ -1,356 +0,0 @@ -= TDM 40100: Project 3 -- 2022 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own `sqlite3` database file. -- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data. -- Run one or more queries to test out the end result. -- Demonstrate the ability to normalize a series of database tables. -- Wrangle and insert data into database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -In case you skipped the previous project, let's all get on the same page. You run the following code in a Jupyter notebook to create a `sqlite3` database called `my.db` in your `$HOME` directory. - -[source,ipython] ----- -%%bash - -rm $HOME/my.db -sqlite3 $HOME/my.db "CREATE TABLE reviews ( - user_id TEXT, - book_id INTEGER, - review_id TEXT, - rating INTEGER, - review_text TEXT, - date_added TEXT, - date_updated TEXT, - read_at TEXT, - started_at TEXT, - n_votes INTEGER, - n_comments INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE books ( - isbn TEXT, - text_reviews_count INTEGER, - country_code TEXT, - language_code TEXT, - asin TEXT, - is_ebook INTEGER, - average_rating REAL, - kindle_asin TEXT, - description TEXT, - format TEXT, - link TEXT, - publisher TEXT, - num_pages INTEGER, - publication_day INTEGER, - isbn13 TEXT, - publication_month INTEGER, - edition_information TEXT, - publication_year INTEGER, - url TEXT, - image_url TEXT, - book_id TEXT, - ratings_count INTEGER, - work_id TEXT, - title TEXT, - title_without_series TEXT -);" - -sqlite3 $HOME/my.db "CREATE TABLE authors ( - average_rating REAL, - author_id INTEGER, - text_reviews_count INTEGER, - name TEXT, - ratings_count INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE series ( - numbered INTEGER, - note TEXT, - description TEXT, - title TEXT, - series_works_count INTEGER, - series_id INTEGER, - primary_work_count INTEGER -);" ----- - -[source,ipython] ----- -%sql sqlite:////home/x-myalias/my.db ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM reviews limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM books limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM authors limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM series limit 5; ----- - -[source,ipython] ----- -%%bash - -rm -rf $HOME/goodreads_samples -mkdir $HOME/goodreads_samples -cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/ -cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/ -shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json -shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json ----- - -=== Question 1 - -Update your original `CREATE TABLE` statement for the `books` table to include a field that will be used to store the actual book cover images from the `image_url` field in the `books` table. Call this new field `book_cover`. Which one of the `sqlite` types did you use? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Check out a line of the `goodreads_books.json` data: - -[source,ipython] ----- -%%bash - -head -n 1 $HOME/goodreads_samples/goodreads_books.json ----- - -[IMPORTANT] -==== -Don't have a `goddreads_samples` directory? Make sure you run the following. - -[source,ipython] ----- -%%bash - -rm -rf $HOME/goodreads_samples -mkdir $HOME/goodreads_samples -cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/ -cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/ -shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json -shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json ----- -==== - -Recall that in the previous project, we just ignored the following fields from the `books` table: `series`, `similar_books`, `popular_shelves`, and `authors`. We did this because those fields are more complicated to deal with. - -Read https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description[this] article on database normalization from Microsoft. We are going to do our best to _normalize_ our tables with these previously ignored fields taken into consideration. - -Let's start by setting some practical naming conventions. Note that these are not critical by any stretch, but can help remove some guess work when navigating a database with many tables and ids. - -. Every table's primary key should be named `id`, unless it is a composite key. For example, instead of `book_id` in the `books` table, it would make sense to call that column `id` -- "book" is implied from the table name. -. Every table's foreign key should reference the `id` column of the foreign table and be named "foreign_table_name_id". For example, if we had a foreign key in the `books` table that referenced an author in the `authors` table, we should name that column `author_id`. -. Keep table names plural, when possible -- for example, not the `book` table, but the `books` table. -. Link tables or junction tables should be named by the two tables which you are trying to represent the many-to-many relationship for. (We will go over this one specifically when needed, no worries) - -Make the appropriate changes to the following `CREATE TABLE` statements that reflect these conventions as much as possible (for now). - -[source,ipython] ----- -%%bash - -rm $HOME/my.db -sqlite3 $HOME/my.db "CREATE TABLE reviews ( - user_id TEXT, - book_id INTEGER, - review_id TEXT, - rating INTEGER, - review_text TEXT, - date_added TEXT, - date_updated TEXT, - read_at TEXT, - started_at TEXT, - n_votes INTEGER, - n_comments INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE books ( - isbn TEXT, - text_reviews_count INTEGER, - country_code TEXT, - language_code TEXT, - asin TEXT, - is_ebook INTEGER, - average_rating REAL, - kindle_asin TEXT, - description TEXT, - format TEXT, - link TEXT, - publisher TEXT, - num_pages INTEGER, - publication_day INTEGER, - isbn13 TEXT, - publication_month INTEGER, - edition_information TEXT, - publication_year INTEGER, - url TEXT, - image_url TEXT, - book_id TEXT, - ratings_count INTEGER, - work_id TEXT, - title TEXT, - title_without_series TEXT -);" - -sqlite3 $HOME/my.db "CREATE TABLE authors ( - average_rating REAL, - author_id INTEGER, - text_reviews_count INTEGER, - name TEXT, - ratings_count INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE series ( - numbered INTEGER, - note TEXT, - description TEXT, - title TEXT, - series_works_count INTEGER, - series_id INTEGER, - primary_work_count INTEGER -);" ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -A book can have many authors, and an author can have many books. This is an example of a many-to-many relationship. - -We already have a `books` table and an `authors` table. Create a _junction_ or _link_ table that effectively _normalizes_ the `authors` **field** in the `books` table. Call this new table `books_authors` (see point 4 above -- this is the naming convention we want). - -Make sure to include your `CREATE TABLE` statement in your notebook. - -[TIP] -==== -There should be 4 columns in the `authors_books` table. A primary key field, two foreign key fields, and a regular data field that is a part of the original `authors` field data in the `books` table. -==== - -[IMPORTANT] -==== -Make sure to properly apply the https://www.sqlitetutorial.net/sqlite-primary-key/[primary key] and https://www.sqlitetutorial.net/sqlite-foreign-key/[foreign key] keywords. -==== - -Write a SQL query to find every book by author with id 12345. It doesn't have to be perfect syntax, as long as the logic is correct. In addition, it won't be runnable, that is okay. - -[TIP] -==== -You will need to use _joins_ and our junction table to perform this query. -==== - -Copy, paste, and update your `bash` cell with the `CREATE TABLE` statements to implement these changes. In a markdown cell, write out your SQL query. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Assume that a series can have many books and a book can be a part of many series. Perform the same operations as the previous problem (except for the query). - -What columns does the `books_series` table have? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -The remaining two fields that need to be dealt with are `similar_books` and `popular_shelves`. Choose _at least_ one of the two and do your best to come up with a good solution for the way we store the data. We will give hints for both below. - -For this question, please copy, paste, and update the `bash` cell with the `CREATE TABLE` statements. In addition, please include a markdown cell with a detailed explanation of _why_ you chose your solution, and provide at least 1 example of a query that _should_ work for your solution (like before, we are looking for logic, not syntax). - -**similar_books:** - -[TIP] -==== -It is okay to have a link table that links rows from the same table! -==== - -[TIP] -==== -There are always many ways to do the same thing. In our examples, we used link tables with their own `id` (primary key) in addition to multiple foreign keys. This provides the flexibility of later being able to add more fields to the link table, where it may even become useful all by itself. - -There is, however, a _technically_ better solution for a table that is simply a link table and nothing more. This would be where you have 2 columns, both foreign keys, and you create a _composite_ primary key, or a primary key that is represented by the unique combination of both foreign keys. This ensures that links are only ever represented once. Feel free to experiment with this if you want! -==== - -**popular_shelves:** - -[TIP] -==== -You can create as many tables as you need. -==== - -[TIP] -==== -After a bit of thinking, this one may not be too different than what you've already accomplished. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc deleted file mode 100644 index 36d5d6ad8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project04.adoc +++ /dev/null @@ -1,195 +0,0 @@ -= TDM 40100: Project 4 -- 2022 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100 (formerly STAT 29000), you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own sqlite3 database file. -- Analyze a large dataset and formulate CREATE TABLE statements designed to store the data. -- Run one or more queries to test out the end result. -- Demonstrate the ability to normalize a series of database tables. -- Wrangle and insert data into database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -This project is going to be a bit more open. The goal of this project is to take our dataset sample, and write Python code to insert it into our `sqlite3` database. There are a variety of ways this could be accomplished, and we will accept anything that works, with a few constraints. - -In the next project, we will run some experiments that will time insertion, project the time it would take to insert all of the data, adjust database settings, and, ultimately, create a final product that we can feel good about. - -=== Question 1 - -As mentioned earlier -- the goal of this project is to insert the sample data into our database. Start by generating the sample data. - -[source,ipython] ----- -%%bash - -rm -rf $HOME/goodreads_samples -mkdir $HOME/goodreads_samples -cp /anvil/projects/tdm/data/goodreads/goodreads_book_authors.json $HOME/goodreads_samples/ -cp /anvil/projects/tdm/data/goodreads/goodreads_book_series.json $HOME/goodreads_samples/ -shuf -n 27450 /anvil/projects/tdm/data/goodreads/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json -shuf -n 98375 /anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json ----- - -In addition, go ahead and copy our empty database that is ready for you to insert data into. - -[source,ipython] ----- -%%bash - -rm $HOME/my.db -cp /anvil/projects/tdm/data/goodreads/my.db $HOME ----- - -You can run this as many times as you need in order to get a fresh start. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write Python code that inserts the data into your database. Here are the constraints. - -. You should be able to fully recover the `book_cover` image from the database. This means you'll need to handle scraping the `image_url` and converting the image to `bytes` before inserting into the database. -+ -[TIP] -==== -Want some help to write the scraping code? Check out https://the-examples-book.com/projects/fall2022/30100-2022-project04#question-2[this 30100 question] for more guidance. -==== -+ -. Your functions and code should ultimately operate on a single _row_ of the datasets. For instance: -+ -[NOTE] -==== -[source,python] ----- -import json - -with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f: - for line in f: - print(line) - parsed = json.loads(line) - print(f"{parsed['isbn']=}") - print(f"{parsed['num_pages']=}") - break ----- -==== -+ -Here, you can see that we can take a single row and do _something_ to it. Why do we want it to work this way? This makes it easy to break our dataset into chunks and perform operations in parallel, if we so choose (and we will, but not in this project). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Demonstrate your database works by doing the following. - -. Fully recover a `book_cover` and display it in your notebook. -+ -[NOTE] -==== -[source,ipython] ----- -%%bash - -rm $HOME/test.db || true -sqlite3 $HOME/test.db "CREATE TABLE test ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - my_blob BLOB -);" ----- - -[source,python] ----- -import shutil -import requests -import os -import uuid -import sqlite3 - -url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' -my_bytes = scrape_image_from_url(url) - -# insert -conn = sqlite3.connect('/home/x-kamstut/test.db') -cursor = conn.cursor() -query = f"INSERT INTO test (my_blob) VALUES (?);" -dat = (my_bytes,) -cursor.execute(query, dat) -conn.commit() -cursor.close() - -# retrieve -conn = sqlite3.connect('/home/x-kamstut/test.db') -cursor = conn.cursor() - -query = f"SELECT * from test where id = ?;" -cursor.execute(query, (1,)) -record = cursor.fetchall() -img = record[0][1] -tmp_filename = str(uuid.uuid4()) -with open(f"{tmp_filename}.jpg", 'wb') as file: - file.write(img) - -from IPython import display -display.Image(f"{tmp_filename}.jpg") ----- -==== -+ -. Run a simple query to `SELECT` the first 5 rows of each table. -+ -[NOTE] -==== -[source,ipython] ----- -%sql sqlite:////home/my-username/my.db ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM tablename LIMIT 5; ----- -==== -+ -[IMPORTANT] -==== -Make sure to replace "my-username" with your Anvil username, for example, x-kamstut is mine. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc deleted file mode 100644 index a2517c1ec..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project05.adoc +++ /dev/null @@ -1,316 +0,0 @@ -= TDM 40100: Project 5 -- 2022 - -**Motivation:** Sometimes taking the time to simply do some experiments and benchmark things can be fun and beneficial. In this project, we are going to run some tests to see how various methods we try impact insertion performance with `sqlite`. - -**Context:** This is the next project in our "deepish" dive into `sqlite3`. Hint, its not really a deep dive, but its deeper than what we've covered before! https://fly.io has been doing a blog series on a truly deep dive: https://fly.io/blog/sqlite-internals-btree/, https://fly.io/blog/sqlite-internals-rollback-journal/, https://fly.io/blog/sqlite-internals-wal/, https://fly.io/blog/sqlite-virtual-machine/. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Learn about some of the constraints `sqlite3` has. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -=== Question 1 - -The data we want to include in our `sqlite3` database is in need of wrangling prior to insertion. It is a fairly sizeable dataset -- let's start by creating our sample dataset so we can use it to estimate the amount of time it will take to create the full database. - -[source,python] ----- -from pathlib import Path - -def split_json_to_n_parts(path_to_json: str, number_files: int, output_dir: str) -> None: - """ - Given a str representing the absolute path to a `.json` file, - `split_json` will split it into `number_files` `.json` files of equal size. - - Args: - path_to_json: The absolute path to the `.json` file. - number_files: The number of files to split the `.json` file into. - output_dir: The absolute path to the directory where the split `.json` - files are to be output. - - Returns: - Nothing. - - Examples: - - This is the second test - >>> test_json = '/anvil/projects/tdm/data/goodreads/test.json' - >>> output_dir = f'{os.getenv("SCRATCH")}/p5testoutput' - >>> os.mkdir(output_dir) - >>> number_files = 2 - >>> split_json_to_n_parts(test_json, number_files, output_dir) - >>> output_dir = Path(output_dir) - >>> number_output_files = sum(1 for _ in output_dir.glob("*.json")) - >>> shutil.rmtree(output_dir) - >>> number_output_files - 2 - """ - path_to_json = Path(path_to_json) - num_lines = sum(1 for _ in open(path_to_json)) - group_amount = num_lines//number_files + 1 - with open(path_to_json, 'r') as f: - part_number = 0 - writer = None - for idx, line in enumerate(f): - if idx % group_amount == 0: - if writer: - writer.close() - - writer = open(str(Path(output_dir) / f'{path_to_json.stem}_{part_number}.json'), 'w') - part_number += 1 - - writer.write(line) ----- - -[source,python] ----- -output_dir = f'{os.getenv("HOME")}/goodreads_samples' -shutil.rmtree(output_dir) -os.mkdir(output_dir) -number_files = 1 -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_authors.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_series.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_reviews_dedup.json', number_files, output_dir) ----- - -Create the empty database. - -[source,ipython] ----- -%%bash - -rm $HOME/my.db || true -sqlite3 $HOME/my.db "CREATE TABLE reviews ( - id TEXT PRIMARY KEY, - user_id TEXT, - book_id INTEGER, - rating INTEGER, - review_text TEXT, - date_added TEXT, - date_updated TEXT, - read_at TEXT, - started_at TEXT, - n_votes INTEGER, - n_comments INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE books ( - id INTEGER PRIMARY KEY, - isbn TEXT, - text_reviews_count INTEGER, - country_code TEXT, - language_code TEXT, - asin TEXT, - is_ebook INTEGER, - average_rating REAL, - kindle_asin TEXT, - description TEXT, - format TEXT, - link TEXT, - publisher TEXT, - num_pages INTEGER, - publication_day INTEGER, - isbn13 TEXT, - publication_month INTEGER, - edition_information TEXT, - publication_year INTEGER, - url TEXT, - image_url TEXT, - ratings_count INTEGER, - work_id TEXT, - title TEXT, - title_without_series TEXT -);" - -sqlite3 $HOME/my.db "CREATE TABLE authors_books ( - id INTEGER PRIMARY KEY, - author_id INTEGER, - book_id INTEGER, - role TEXT, - FOREIGN KEY (author_id) REFERENCES authors (id), - FOREIGN KEY (book_id) REFERENCES books (id) -);" - -sqlite3 $HOME/my.db "CREATE TABLE books_series ( - id INTEGER PRIMARY KEY, - book_id INTEGER, - series_id INTEGER, - FOREIGN KEY (book_id) REFERENCES books (id), - FOREIGN KEY (series_id) REFERENCES series (id) -);" - -sqlite3 $HOME/my.db "CREATE TABLE authors ( - id INTEGER PRIMARY KEY, - average_rating REAL, - text_reviews_count INTEGER, - name TEXT, - ratings_count INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE shelves ( - id INTEGER PRIMARY KEY, - name TEXT -);" - -sqlite3 $HOME/my.db "CREATE TABLE books_shelves ( - id INTEGER PRIMARY KEY, - shelf_id INTEGER, - book_id INTEGER, - count INTEGER, - FOREIGN KEY (shelf_id) REFERENCES shelves (id), - FOREIGN KEY (book_id) REFERENCES books (id) -);" - -sqlite3 $HOME/my.db "CREATE TABLE series ( - id INTEGER PRIMARY KEY, - numbered INTEGER, - note TEXT, - description TEXT, - title TEXT, - series_works_count INTEGER, - primary_work_count INTEGER -);" ----- - -Check out `/anvil/projects/tdm/data/goodreads/gr_insert.py`. Use the unix `time` function to execute the script and determine how long it took to run. Estimate the amount of time it will would take to insert the full dataset. To run the script in a bash cell, you would do something like. - -[source,ipython] ----- -%%bash - -python3 /anvil/projects/tdm/data/goodreads/gr_insert.py 0 ----- - -Where the single argument indicates which files to read in. In this first example, it will process all files ending in `_0`. When we further split the data into parts, this will help use point the script at certain subsets of the data. - -[IMPORTANT] -==== -To keep things simplified, we are going to skip a few things that take more time. Mainly, scraping the images, and the authors_books, books_shelves, and books_series tables. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Typically, one way to speed things up is to throw more processing power at it. Let's use 2 processes instead of 1 to insert our data. Start with a fresh (empty) database, and reinsert your data but use `joblib` to use 2 processors. What happened? - -[TIP] -==== -Copy `gr_insert.py` into the same directory as your notebook. Then, the following imports will work. - -[source,python] ----- -from gr_insert import insert_all -import joblib -from joblib import Parallel -from joblib import delayed ----- -==== - -[TIP] -==== -https://joblib.readthedocs.io/en/latest/parallel.html[This] example should help. -==== - -[TIP] -==== -To get started, split your data into parts as follows. - -[source,python] ----- -output_dir = f'{os.getenv("HOME")}/goodreads_samples' -shutil.rmtree(output_dir) -os.mkdir(output_dir) -number_files = 2 -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_authors.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_book_series.json', number_files, output_dir) -split_json_to_n_parts(f'/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_reviews_dedup.json', number_files, output_dir) ----- -==== - -[TIP] -==== -You should get an error talking about something being "locked". -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -`sqlite3`, by default, can only have 1 writer at a time. So even though we have two processes trying to insert data, `sqlite3` can't keep going. In our case, one of the processes got a "database locked" issue. That's a huge bummer, but at least we can run queries while data is being inserted, right? Let's give it a try. - -Start with a fresh database. Run the following command in a bash cell. This will spawn two processes that will try to connect to the database at the same time. The first process will be inserting data (like before). The second process will try to continually make silly `SELECT` queries for 1 minute. - -[source,ipython] ----- -%%bash - -python3 gr_insert.py 0 & -python3 gr_insert.py 0 read & ----- - -What happens? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -As you may have figured out, no, by default you cannot both read and write data to an `sqlite3` database concurrently. However, this is possible by activating the write ahead log (WAL), cool! - -Start with a fresh database again, figure out _how_ to activate the WAL, activate it, and repeat question 3. Does it work now? - -This is a pretty big deal, and makes `sqlite3` an excellent choice for any database that doesn't need to have fantastic, concurrent write performance. Things like blogs and other small data systems could easily be backed by `sqlite3`, no problem! It also means that if you have an application that is creating a lot of data very rapidly, it is possibly _not_ the best choice. - -The WAL is an actual file? Find the file, what is it named? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Read the 4 articles provided in the **context** at the top of the project. Write a short paragraph about what you learned. What was the thing you found most interesting? If you are interested, feel free to try and replicate some of the examples they demonstrate. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc deleted file mode 100644 index 944a5508c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project06.adoc +++ /dev/null @@ -1,326 +0,0 @@ -= TDM 40100: Project 6 -- 2022 - -== Looking sharp for fall break? - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this next series of projects. - -**Context:** We are about to dive straight into a series of projects that emphasize working with images (with other fun things mixed in). We will start out with a straightforward task, with testable results. - -**Scope:** Python, images - -.Learning Objectives -**** -- Use `numpy` and `skimage` to process images. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/apple.jpg` - -== Questions - -=== Question 1 - -We are going to use scikit-image to load up our image. - -[source,python] ----- -from skimage import io, color -from matplotlib.pyplot import imshow -import numpy as np -import hashlib -from typing import Callable - -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -imshow(img) ----- - -If you take a look at `img`, you will find it is simply a `numpy.ndarrary`. If you were to take a look, you would find that it is currently represented by a 100x100x3 array. The first matrix represents the pixels red values, the send the pixels green values, and the third the blue values. For example, we could make our apple greener as follows. - -[source,python] ----- -test = img.copy() -test = test + [0,100,0] -imshow(test) ----- - -In this project, we are going to sharpen this image using a technique called unsharp masking. In order to do this, we will need to first change our image representation from RGB (red, gree, blue) to LAB. The L in LAB stands for perceptual lightness. The a and b are used to represent the 4 unique colors of human vision: red, green, blue, and yellow. If we don't convert to this representation, our sharpening can distort the colors of our image, which is _not_ what we want. By using LAB, we can apply our transformation to _just_ the lightness (the L), and our colors won't appear distorted or changed. - -The following is an example of converting to LAB. - -[source,python] ----- -img = color.rgb2lab(img) ----- - -To convert back, you can use the following. - -[source,python] ----- -img = color.lab2rgb(img)*255 ----- - -The reason for the 255 is because during the conversions the values are rescaled to between 0 and 1, and we will want that to be between 0 and 255 so we can export properly later on. - -The first step in creating our unsharp mask, and the task for question (1), is to create a _filter_ for our image. The _filter_ should be represented as a function, `my_filter`. `my_filter` should accept a single argument, `img`, which is a numpy ndarray, similar to `img` in our first provided snippet of code. `my_filter` should return a numpy ndarray that has been processed. - -[source,python] ----- -def my_filter(img: np.ndarray) -> np.ndarray: - """ - Given an ndarray representation of an image, - apply a median blur to the image. - """ - pass ----- - -Implement a _median_ filter, that takes a target pixel, gets all of the immediate neighboring pixels, and sets the target pixel to the median value of the neighbors, and itself (total of 9 pixels). Make sure that the pixels you are getting the median of are the _original_ pixels, not the already filtered pixels. To simplify things, completely ignore and copy over the border pixels so we don't have to worry about all of those edge cases. The sharpened image will have a 1 pixel border that is equivalent to the original. - -[TIP] -==== -The following is an example of finding a median of multiple pixels. - -[source,python] ----- -img = io.imread('/anvil/projects/tdm/data/images/apple.jpg') -np.median(np.stack((img[0, 0, :], img[0, 50, :])), axis=0) ----- -==== - -[IMPORTANT] -==== -. This is the most difficult question for this project, the rest should be quicker. -. It may take a minute or so to run for images larger than our `apple.jpg` -- we aren't using any special prebuilt functions or optimizations, so it is a lot of looping. -==== - -[TIP] -==== -To verify your filter is working properly, you can run the following code and make sure the hash is the same. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -img = color.rgb2lab(img) -filtered = my_filter(img) -filtered = color.lab2rgb(filtered) -filtered = (filtered*255).astype('uint8') -io.imsave("filtered.jpg", filtered) -with open("filtered.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output ----- -9a5d9f62d52bcb96ea68a86dc1e3a6ae3a9715ff86476c4ccec3b11e4e7dde8e ----- - -To see the blur: - -[source,python] ----- -imshow(filtered) ----- - -To see the blurred image normal scaled: - -[source,python] ----- -from IPython import display -display.Image("filtered.jpg") ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The next step in the process is to create a _mask_. To create the mask, write a function called `create_mask` that accepts the image, a filter function (`my_filter` from question (1)), and `strength`. `create_mask` should return the image (as an ndarray). - -[source,python] ----- -def create_mask(img: np.ndarray, filt: Callable, strength: float = 0.8) -> np.ndarray: - """ - Given the original image, a filter function, - and a strength value. Return a mask. - """ - pass ----- - -The _mask_ is simple. Take the given image, apply the filter to the image, and subtract the resulting image from the original. Take that result, and multiple by `strength`. `strength` is a value typically between .2 and 2 that effects how strongly to sharpen the image. - -[TIP] -==== -Test to make sure your result is correct by running the following. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -img = color.rgb2lab(img) -mask = create_mask(img, my_filter, 2) -mask = color.lab2rgb(mask) -mask = (mask*255).astype('uint8') -io.imsave("mask.jpg", mask) -with open("mask.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output ----- -e6cd9badbcb779615834e734d65730e42ded4db2030e0377d5c85ea6399d191a ----- - -Take a look at the mask itself! This will help you understand _what_ the mask actually is. - -[source,python] ----- -imshow(mask) ----- - -To see the properly scaled mask: - -[source,python] ----- -from IPython import display -display.Image("mask.jpg") ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The final step is to _apply_ your mask to the original image! Write a function called `unsharp` that accepts an image (as an ndarry, like usual), and a `strength` and applies the algorithm! - -[source,python] ----- -def unsharp(img: np.ndarray, strength: float = 0.8) -> np.ndarray: - """ - Given the original image, and a strength value, - return the sharpened image in numeric format. - """ - - def _create_mask(img: np.ndarray, filt: Callable, strength: float = 0.8) -> np.ndarray: - """ - Given the original image, a filter function, - and a strength value. Return a mask. - """ - return (img - filt(img))*strength - - - def _filter(img: np.ndarray) -> np.ndarray: - """ - Given an ndarray representation of an image, - apply a median blur to the image. - """ ----- - -How do you apply the full algorithm? - -. Create the mask using the `create_mask` function. -. Add the result to the numeric form of the original image. - -That is pretty straightforward! Of course, you'll need to convert back to RGB before exporting, like normal, but it really isn't that bad! - -[TIP] -==== -You can verify things are working as follows. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -sharpened = color.rgb2lab(img) -sharpened = unsharp(sharpened, 2) -sharpened = color.lab2rgb(sharpened) -sharpened = (sharpened*255).astype('uint8') -io.imsave("sharpened.jpg", sharpened) -with open("sharpened.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output ----- -e6cd9badbcb779615834e734d65730e42ded4db2030e0377d5c85ea6399d191a ----- - -You can test to see what the sharpened image looks like as follows. - -[source,python] ----- -imshow(sharpened) ----- - -Or the normally scaled image: - -[source,python] ----- -from IPython import display -display.Image("sharpened.jpg") ----- -==== - -[NOTE] -==== -There are quite a few ways you could change this algorithm to get better or slightly different results. -==== - -[NOTE] -==== -There is quite a bit of magic that happens during the `color.lab2rgb` conversion. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Find another image (it could be anything) and use your function to sharpen it. Mess with the strength parameter to see how it effects things. Show at least 1 before and after image. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (optional) - -Instead of using the median blur effect, you could use a different filter, like a Gaussian blur. If you Google a bit, you will find that there are premade (and probably much faster) functions to perform a Gaussian blur. Use the Gaussian blur in place of the median blur, and perform the unsharp mask. Are the results better or worse in your opinion? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc deleted file mode 100644 index f6b088b29..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project07.adoc +++ /dev/null @@ -1,206 +0,0 @@ -= TDM 40100: Project 7 -- 2022 -:page-mathjax: true - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous project, we learned to sharpen images using unsharp masking. In this project, we will perform edge detection using Sobel filtering. - -**Scope:** Python, images, JAX - -.Learning Objectives -**** -- Process images using `numpy`, `skimage`, and `JAX`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/apple.jpg` -- `/anvil/projects/tdm/data/images/drward.jpg` - -== Questions - -=== Question 1 - -Let's, once again, work with our `apple.jpg` image. In the previous project, we sharpened the image using unsharp masking. In this project, we are going to try to detect edges using a Sobel filter. The first step in this process is to convert our image from color to greyscale. - -There are a few ways to do this, we will use the luminosity method. Create a function called `to_greyscale` that accepts the image in numeric numpy ndarray form, and returns the modified image in the numeric numpy ndarray form. - -[NOTE] -==== -The luminosity method of conversion takes into consideration that our eyes don't react to each color the same way. You can read about some of the other methods https://www.baeldung.com/cs/convert-rgb-to-grayscale[here]. -==== - -$gray = \frac{(0.2989*R + 0.5870*G + 0.1140*B)}{255}$ - -Confirm your function works. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -img = to_greyscale(img) -io.imsave("grey.jpg", img) -with open("grey.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output ----- -d3aac435526a98d5d8665c558a96b834b63e5f17531b6e197b14d3b527406970 ----- - -To display the greyscale image using `imshow`, you must include the `cmap="gray"` option. - -[source,python] ----- -imshow(img, cmap="gray") ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The big picture with edge detection is detecting sudden changes in pixel intensities. A natural way to find changes is by using gradients/derivatives! That could be time consuming, hence the genius of the Sobel filter. - -Write a function called `estimate_gradients` that uses the https://en.wikipedia.org/wiki/Sobel_operator[Sobel filter] to estimate the gradients, `gx` and `gy`. `gx` is the gradient in the x direction and `gy` is the gradient in the y direction. `estimate_gradients` should accept the image and return both `gx` and `gy`. - -To calculate the estimated gradients, you must take a pixel and its eight neighbors, multiply them by a 3x3 "kernel", and sum the results. In a lot of ways, this is very similar to what you did manually in the previous project. However, this operation is much more popular -- so popular, it has a name -- https://en.wikipedia.org/wiki/Kernel_(image_processing)#Convolution[_convolution_]. - -Read https://en.wikipedia.org/wiki/Sobel_operator[the Sobel operator] wikipedia page, and look at the provided kernels used to calculate the gradient estimates. Use https://jax.readthedocs.io/en/latest/_autosummary/jax.scipy.signal.convolve.html#jax-scipy-signal-convolve[this] function to calculate and return both `gx` and `gy`. - -[TIP] -==== -You will want your resulting image to be the same dimesion as _before_ the convolve function. -==== - -[TIP] -==== -You can verify your output. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -img = to_greyscale(img) -gx, gy = estimate_gradients(img) -io.imsave("gx.jpg", gx) -with open("gx.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -print(m.hexdigest()) - -io.imsave("gy.jpg", gy) -with open("gy.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -print(m.hexdigest()) ----- - -.output ----- -966c8530c02913ccc44b922ce9b42e6b85679a743b5e44757dc88ec2adfd21af -e06ff1ed6edb589887a52d7fe154b84a12495d0ab487045e26cb0b34fc0b5402 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -What we _really_ want is a single gradient that combines both `gx` and `gy`. We can obtain it using the following formula. - -$G = \sqrt{gx^2 + gy^2}$ - -Alternatively, the following would work as well. - -$G = |gx| + |gy|$ - -Decide which formula to implement. Bring everything you've written so far together into a single function called `get_edges`. `get_edges` should accept the image (as a numeric `np.ndarray`, and return the final result, a greyscale image with edges clearly defined. - -You can verify your solution with the following. Note that depending on which method you chose, the resulting hash will be different. We've included both possibilities. - -Which method did you choose and why? - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/apple.jpg") -img = get_edges(img) -io.imsave("edge.jpg", img) -with open("edge.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() ----- - -.output options ----- -6386859f42d9d7664b79d75f2b375058c1d0a61defb9a055caaaa69ad95504ad -3ac023a3900013e000e40812b96f7c120edd921cc483cec2f3d0d547a6e2675b ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -The Sobel filter is very effective, but like most things, has flaws. One such flaw is the sensitivity to noise. There are some ways around that. - -- You could threshold the output. If G is less than a certain value, you can force the value to be 0. -- You can apply another filter to blur the image _prior_ to calculating the gradient estimates (just like we did with the median filter in the previous project!). - -Create two new functions: `get_edgesv1`, and `get_edgesv2`. Version 1 should use the cutoff method and version 2 should use the blur method. - -This question will be graded by looking at the outputted images, since there are many variations of possible result. Play around with the cutoff value in version 1. For version 2, please feel free to use our new `convolve` function to use a _mean_ instead of median blur. - -[TIP] -==== -The `convolve` function makes it _super_ easy to apply a mean blur. Think about what `convolve` does and you should be able to figure out how to create a mean blur really quickly. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Apply your favorite edge detection function that you've built to a new image. How did it work? Why did you like the edge detection function you chose best? Write 1-2 sentences about your choice, and make sure to show the results of your image. - -Feel free to use `/anvil/projects/tdm/data/images/coke.jpg` -- the results are pretty neat! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc deleted file mode 100644 index b75a70f58..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project08.adoc +++ /dev/null @@ -1,228 +0,0 @@ -= TDM 40100: Project 8 -- 2022 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous project, we worked with images and implemented edge detection, with some pretty cool results! In these next couple of projects, we will continue to work with images as we learn how to compress images from scratch. This is the first in a series of 2 projects where we will implement a variation of jpeg image compression! - -**Scope:** Python, images, JAX - -.Learning Objectives -**** -- Process images using `numpy`, `skimage`, and `JAX`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/drward.jpg` -- `/anvil/projects/tdm/data/images/barn.jpg` - -== Questions - -=== Question 1 - -JPEG is a _lossy_ compression format and an example of transform based compression. Lossy compression means that you can't retrieve the information that was lost during the compression process. In a nutshell, these methods use statistics to identify and discard redundant data. - -For this project, we will start by reading in a larger image (than our previously used apple image): `/anvil/projects/tdm/data/images/drward.jpg`. - -[source,python] ----- -from skimage import io, color -from matplotlib.pyplot import imshow -import numpy as np -import jax -import jax.numpy as jnp -import hashlib -from IPython import display - -img = io.imread("/anvil/projects/tdm/data/images/drward.jpg") ----- - -By default, our image is read in as an RGB image, where each pixel is represented as a value between 0 and 255, where the first value represents "red", the second "green", and the third "blue". - -In order to implement our compression algorithm, we need to change the representation of the image to the https://en.wikipedia.org/wiki/YCbCr[YCbCr] color space. Use the https://scikit-image.org/docs/stable/api/skimage.color.html[scikit-image] library we've used in previous projects to convert to the new color space. What are the dimensions now? - -Check out the 3-image example https://en.wikipedia.org/wiki/YCbCr[here] (the barn). Replicate this image by splitting `/anvil/projects/tdm/data/images/barn.jpg` into its YCbCr components and display them. Do the same for our `drward.jpg`. - -[TIP] -==== -To display the YCbCr Y component, you will need to set the Cb and Cr components to 127. To display the Cb component, you will need to set the Cr and Y components to 127, etc. You can confirm the results by looking at your `barn.jpg` components and seeing if they look the same as the wikipedia page we linked above. -==== - -.Items to submit -==== -- Display containing the 3 Y, Cb, and Cr components for both `drward.jpg` and `barn.jpg`, for a total of 6 images. -==== - -=== Question 2 - -Our eyes are more sensitive to luminance than to color. As you can tell from the previous question, the Y component captures the luminance, and contains the majority of the image detail that is so important to our vision. The other Cb and Cr components are essentially just color components, and our eyes aren't as sensitive to changes in those components. Since our eyes aren't as sensitive, we don't need to capture that data as accurately, and is an opportunity to reduce what we store! - -Let's perform an experiment that makes this explicitly clear, as well as takes us 1 more step in the direction of having a compressed image. - -Downsample the Cb and Cr components and display the resulting image. There are a variety of ways to do this, but the one we will use right now is essentially to just round the values to the nearest rounded value of a certain factor. For instance, maybe we only want to represent values between 150 and 160 as 150 _or_ 160. So 151.111 becomes 150. 155.555 becomes 160. This could be done as follows. - -[source,python] ----- -10*np.round(img/10) ----- - -Or, if you wanted more granularity, you could do. - -[source,python] ----- -2*np.round(img/2) ----- - -Ultimately, let's refer to this as our _factor_. - -[source,python] ----- -factor*np.round(img/factor) ----- - -Downsample the Cb and Cr components using a factor of 10 and display the resulting image. - -[TIP] -==== -Here is some maybe-useful skeleton code. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") -img = color.rgb2ycbcr(img) -# create "dimg" that contains the downsampled Cb and Cr components -dimg = color.ycbcr2rgb(dimg) -io.imsave("dcolor.jpg", dimg, quality=100) -with open("dcolor.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -print(m.hexdigest()) -display.Image("dcolor.jpg") ----- - -"dcolor" is just a name we chose to mean downsampled color, as in, we've downsampled the color components. - -The hash should be the following. - ----- -7bf01998d636ac71553f6d82da61a784ce50d2ab9f27c67fd16243bf1634583b ----- -==== - -Fantastic! Can you tell a difference by just looking at the original image and the color-downsampled image? - -Okay, let's perform the _same_ operation, but this time, instead of downsampling the Cr and Cb components, let's downsample the Y component (and _only_ the Y component). Downsample using a factor of 10. Display the new image. Can you tell a difference by just looking at the original image and the luminance-downsampled image? - -[TIP] -==== -The hash for the luminance downsampled image should be the following. - ----- -dff9e0688d4367d30aa46615e10701f593f1d283314c039daff95c0324a4424d ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The previous question was pretty cool (_at least in my opinion_)! It really demonstrates how our brains are much better at perceiving changes in luminance vs. color. - -Downsampling is an important step in the process. In the previous question, we essentially learned that we can remove color detail by a factor of 10 and not see a difference! - -The next step in our compression process is to convert our image data into numeric frequency data using a discrete cosine transform. This data representation allows us to quantify what data from the image is important, and what is less important. Lower frequency components are more important, and higher are less important can essentially be considered "noise". - -Create a new function called `dct2` that uses https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html[scipys dct] function, but performs the same operation over axis 0, and then over axis 1. Use `norm="ortho"`. - -[TIP] -==== -Test it out to verify things are working well. - -[source,python] ----- -test = np.array([[1,2,3],[3,4,5],[5,6,7]]) -dct2(test) ----- - -.output ----- -array([[ 1.20000000e+01, -2.44948974e+00, 4.44089210e-16], - [-4.89897949e+00, 0.00000000e+00, 0.00000000e+00], - [ 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -For each 8x8 block of pixels in each channel (Y, Cb, Cr), apply the transformation, creating an all new array of frequency data. - -[TIP] -==== -To loop through 8x8 blocks using numpy, check out the results of the following loop. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") -img = color.rgb2ycbcr(img) -s = img.shape -for i in np.r_[:s[0]:8]: - print(np.r_[i:(i+8)]) ----- -==== - -[TIP] -==== -To verify your results, you can try the following. Note that `freq` is the result of applying the `dct2` function to each 8x8 block in the image. - -[source,python] ----- -dimg = color.ycbcr2rgb(freq) -io.imsave("dctimg.jpg", dimg, quality=100) -with open("dctimg.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -print(m.hexdigest()) -display.Image("dctimg.jpg") ----- - -.output ----- -e45dc2a1a832f97bbb3f230ffaf6688d7f50307d6e43020df262314e9dd577e5 ----- -==== - -[TIP] -==== -Another fun (?) way to test is to apply the `dct2` function to every 8x8 block of every channel twice. The resulting image should _kind of_ look like the original. This is because the inverse function is pretty close to the function itself. We will see this in the next project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc deleted file mode 100644 index e7e159685..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project09.adoc +++ /dev/null @@ -1,370 +0,0 @@ -= TDM 40100: Project 9 -- 2022 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous project, we worked with images and implemented edge detection, with some pretty cool results! In these next couple of projects, we will continue to work with images as we learn how to compress images from scratch. This is the first in a series of 2 projects where we will implement a variation of jpeg image compression! - -**Scope:** Python, images, JAX - -.Learning Objectives -**** -- Process images using `numpy`, `skimage`, and `JAX`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/drward.jpg` -- `/anvil/projects/tdm/data/images/barn.jpg` - -== Questions - -[NOTE] -==== -Some helpful links that were really useful. - -- https://en.wikipedia.org/wiki/JPEG -- https://en.wikipedia.org/wiki/Quantization_(image_processing) -- https://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf (if you are interested in Huffman coding) -==== - -=== Question 1 - -In the previous project, we were able to isolate and display the various YCbCr components of our `barn.jpg` image. In addition, we were able to use the discrete cosine transformation to convert each of the channels of our image (Y, Cb, and Cr) to signal data. - -Per https://www.mathworks.com/help/images/discrete-cosine-transform.html[mathworks], the discrete cosine transform has the property that visually significant information about an image is concentrated in just a few coefficients of the resulting signal data. Meaning, if we are able to capture the majority of the visually-important data from just a few coefficients, there is a lot of opportunity to _reduce_ the amount of data we need to keep! - -Start from the end of the previous project. Load up some libraries. - -[source,python] ----- -from skimage import io, color, viewer -from matplotlib.pyplot import imshow -import numpy as np -import jax -import jax.numpy as jnp -import hashlib -from IPython import display -import scipy - -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") ----- - -In addition, load up your `dct2` function, and create a numpy ndarray called `freq` that holds the image data (for `barn.jpg`) converted using the discrete cosine transform. - -Let's take a step back, and clarify a couple things. - -. We will not _actually_ be compressing our image, but we will be demonstrating how we can store the images data with less space, and very little loss of image detail. -. We will still use a simple method to estimate _about_ how much space we would save if we did compress our image. -. We will display a "compressed" version of the image. What this means is that you will be able to view the jpeg _after_ it has lost the data it would normally lose during the compression process. - -Okay, begin by taking the original RGB `img` and displaying the first 8x8 block of data for each of the R, G, and B channels. Next, display the first 8x8 block of data for the Y, Cb, and Cr channels. Finally, use `dct2` to create the `freq` ndarray (like you did in the previous project). Display the first 8x8 block of the Y, Cb, and Cr channels after the transformation. - -[WARNING] -==== -When we say "display 8x8 blocks" we do not mean show an image -- we mean show the numeric data in the form of a numpy array. An 8x8 numpy array printed out using `np.array_str` (see the next "important" box). -==== - -[IMPORTANT] -==== -By default, numpy arrays don't print very nicely. Use `np.array_str` to "pretty" print your arrays. - -[source,python] ----- -np.array_str(myarray, precision=2, suppress_small=True) ----- -==== - -[TIP] -==== -To get you started, this would be how you print the R, G, and B channels first 8x8 block. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") -print(np.array_str(img[:8, :8, 0], precision=2, suppress_small=True)) -print(np.array_str(img[:8, :8, 1], precision=2, suppress_small=True)) -print(np.array_str(img[:8, :8, 2], precision=2, suppress_small=True)) ----- -==== - -[TIP] -==== -The following are `dct2` and `idct2`. - -[source,python] ----- -def dct2(x): - out = scipy.fftpack.dct(x, axis=0, norm="ortho") - out = scipy.fftpack.dct(out, axis=1, norm="ortho") - return out ----- - -[source,python] ----- -def idct2(x): - out = scipy.fftpack.idct(x, axis=1, norm="ortho") - out = scipy.fftpack.idct(out, axis=0, norm="ortho") - return out ----- -==== - -[TIP] -==== -If you did not complete the previous project, no worries, please check out question (5). This will provide you with code that lets you efficiently loop through 8x8 blocks for each channel. This is important for creating the `freq` array containing the signal data. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") - -# convert to YCbCr -img = color.rgb2ycbcr(img) -img = img.astype(np.int16) - -s = img.shape -freq = np.zeros(s) - -for channel in range(3): - for i in np.r_[:s[0]:8]: - for j in np.r_[:s[1]:8]: - - # apply dct here ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The output should be 3 sets of 3 8x8 numeric numpy matrices. -- The first matrix should be the R, G, and B channels. -- The second matrix should be the Y, Cb, and Cr channels. -- The third matrix should be the Y, Cb, and Cr channels after being converted to frequency data using `dct2`. -==== - -=== Question 2 - -Take a close look at the final set of 8x8 blocks in the previous question -- the blocks _after_ the DCT was applied. You'll notice the top left corner value is much different than the rest. This is the _DC coefficiant_. The rest are called _AC coefficients_. - -We forgot an important step. _Before_ we apply the `dct2`, we need to shift the our data to be centered around 0 instead of 127. We can do this by subtracting 127 from every value _before_ applying DCT. - -Re-print the first 8x8 block of `freq` after centering -- do the results look much different? According to https://en.wikipedia.org/wiki/JPEG[wikipedia], this step reduces the dynamic range requirements in the DCT processing stage. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The output should be 1 set of 3 8x8 numeric numpy matrices. -- The output should be very close to the third matrix from question (1), but we center the data _before_ applying dct. -==== - -=== Question 3 - -Okay, great! The next step in this process is to quantize our `freq` signal data. You can read more about quantization https://en.wikipedia.org/wiki/Quantization_(image_processing)[here]. Apparently, the human brain is not very good at distinguishing changes in high frequency parts of our data, but good at distinguishing low frequency changes. - -We can use a quantization matrix to filter out the higher frequency data and maintain the lower frequency data. One of the more common quantization matrices is the following. - -[source,python] ----- -q1 = np.array([[16,11,10,16,24,40,51,61], - [12,12,14,19,26,28,60,55], - [14,13,16,24,40,57,69,56], - [14,17,22,29,51,87,80,62], - [18,22,37,56,68,109,103,77], - [24,35,55,64,81,104,113,92], - [49,64,78,87,103,121,120,101], - [72,92,95,98,112,100,103,99]]) -print(np.array_str(q1, precision=2, suppress_small=True)) ----- - -[quote, , wikipedia] -____ -The quantization matrix is designed to provide more resolution to more perceivable frequency components over less perceivable components (usually lower frequencies over high frequencies) in addition to transforming as many components to 0, which can be encoded with greatest efficiency. -____ - -Take the `freq` signal data and divide the first 8x8 block by the quantization matrix. Use `np.round` to immediately round the values to the nearest integer. Use `np.array_str` to once again, display the resulting, quantized 8x8 block, for each of the 3 channels. - -Wow! The results are interesting, and _this_ is where the _majority_ of the actual data loss (and compression) takes place. Let's take a minute to explain what would happen next. - -. The data would be encoded by first using https://en.wikipedia.org/wiki/Run-length_encoding[run-length encoding] -. Then, the data would be encoded by using https://en.wikipedia.org/wiki/Huffman_coding[Huffman coding]. -+ -[NOTE] -==== -The details are beyond this course, however, it is not _too_ inaccurate to say that the zeros essentially don't need to be stored anymore. So for our first 8x8 block, we went from needing to store about 64 values to only 1, for each channel for a total of 192 to 3. -==== -+ -. The encoded data, and all of the information (huffman tables, quantization tables, etc.) needed to _reverse_ the process and _restore_ the image would be structure carefully and stored as a jpeg file. - -Then, when some goes to _open_ the image, the jpeg file contains all of the information needed to _reverse_ the process and the image is displayed! - -You may be wondering -- wait, you are saying we can take those super sparse matrices we just printed and get back to our original RGB values? Nope! But we can recover the "important stuff" that creates an image that looks visually identical to our original image! This would be, in effect, the same image we would see if we implemented the entire algorithm and displayed the resulting image! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Output should be 1 set of 3 8x8 matrices that apply the quantization matrix and rounding after dct. -==== - -=== Question 4 - -Use the following `idct2` function (the inverse of `dct2`) and print out the first 8x8 for each channel _after_ the process has been inversed. Starting with the quantized `freq` data from the previous question, the inverse process would be the following. - -. Multiply by the quantization table. -. Use the `idct2` function to reverse the dct. -. Add 127 to the final result to undo the shift highlighted in question (2). - -Use `np.array_str` to print the first 8x8 block for each channel. Do the results look fairly close to the original YCbCr channel values? Impressive! - -[TIP] -==== -[source,python] ----- -def idct2(x): - out = scipy.fftpack.idct(x, axis=1, norm="ortho") - out = scipy.fftpack.idct(out, axis=0, norm="ortho") - return out ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Output should be 1 set of 3 8x8 matrices that apply the quantization matrix, round, de-apply quantization matrix, perform inverse dct, and de-shift the values. These matrices should be _nearly_ the same as the original YCbCr values from question (1). -==== - -=== Question 5 - -Let's put it all together! While we aren't fully implementing the compression algorithm, we _do_ implement the parts that cause loss (hence jpeg is a _lossy_ algorithm). Since we implement those parts, we should also be able to view the lossy version of the image to see if we can perceive a difference! In addition, we could also count the number of non-zero values in our image data _before_ we process anything, and re-count immediately after the quantization and rounding, where many zeros appear in our matrices. This will _quite roughly_ tell us the savings if we were to implement the entire algorithm! - -[TIP] -==== -You can use https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html#numpy.count_nonzero[np.count_nonzero] to count the non-zero values of an array. -==== - -For our `barn.jpg` image, walk through the entire algorithm (excluding the encoding parts). Reverse the process after quantization and rounding, all the way back to saving and displaying the lossy image. Since this has been a bit of a roller coaster project, we will provide some skeleton code for you to complete. - -[source,python] ----- -img = io.imread("/anvil/projects/tdm/data/images/barn.jpg") - -# TODO: count the nonzero values before anything -original_nonzero = - -q1 = np.array([[16,11,10,16,24,40,51,61], - [12,12,14,19,26,28,60,55], - [14,13,16,24,40,57,69,56], - [14,17,22,29,51,87,80,62], - [18,22,37,56,68,109,103,77], - [24,35,55,64,81,104,113,92], - [49,64,78,87,103,121,120,101], - [72,92,95,98,112,100,103,99]]).astype(np.int16) - -# convert to YCbCr -img = color.rgb2ycbcr(img) -img = img.astype(np.int16) - -# TODO: shift values to center around 0, for each channel - -s = img.shape -freq = np.zeros(s) - -# downsample <- from previous project -img[:,:,1] = 2*np.round(img[:,:,1]/2) -img[:,:,2] = 2*np.round(img[:,:,2]/2) - -# variable to store number of non-zero values -nonzero = 0 - -for channel in range(3): - for i in np.r_[:s[0]:8]: - for j in np.r_[:s[1]:8]: - - # Example: printing a 8x8 block - # Note: this can (and should) be deleted - print(freq[i:(i+8), j:(j+8), channel]) - - # TODO: apply dct to current 8x8 block - - - # TODO: apply quantization to current 8x8 block - - - # TODO: round values of the current 8x8 block - - - # TODO: increment our count of non-zero values - - - # TODO: de-quantize the current 8x8 block - - - # TODO: apply inverse dct to current 8x8 block - - - -# TODO: de-shift the values that were previous shifted, for each channel - -# convert back to RGB -img = color.ycbcr2rgb(freq) - -# print the number of nonzero values immediately post-quantization -print(f"Non-zero values: {nonzero}") - -# print the _very_ approximate reduction of space for this image -print(f"Reduction: {nonzero/original_nonzero}") - -# multiply image by 255 to rescale values to be between 0 and 255 instead of 0 and 1 -img = img*255 - -# TODO: clip values greater than 255 and set those values equal to 255 - -# TODO: clip values less than 0 and set those values equal to 0 - -# save the "compressed" image so we can display it -# NOTE: The file won't _actually_ be compressed, but it will be visually identical to a compressed image -# since the lossy parts of the algorithm (the parts of the algorithm where we lose "unimportant" pieces of data) -# have already taken place. -io.imsave("compressed.jpg", img, quality=100) -with open("compressed.jpg", "rb") as f: - my_bytes = f.read() - -m = hashlib.sha256() -m.update(my_bytes) -print(m.hexdigest()) -display.Image("compressed.jpg") ----- - -[source,python] ----- -# display the original image, for comparison -display.Image("/anvil/projects/tdm/data/images/barn.jpg") ----- - -[TIP] -==== -The hash I got was the following. - -.hash ----- -bc004579948c5b699b0df52eb69ce168147481a2430d828939cfa791f59783e7 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc deleted file mode 100644 index c4ec25ed3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project10.adoc +++ /dev/null @@ -1,199 +0,0 @@ -= TDM 40100: Project 10 -- 2022 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well. - -**Context:** This is the first in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this first project, we will start slow with a `selenium` review with a small scraping challenge. - -**Scope:** selenium, Python, web scraping - -.Learning Objectives -**** -- Use selenium to interact with a web page prior to scraping. -- Use selenium and xpath expressions to efficiently scrape targeted data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -The following code provides you with both a template for configuring a Firefox browser selenium driver that will work on Anvil, as well as a straightforward example that demonstrates how to search web pages and elements using xpath expressions, and emulating a keyboard. Take a moment, run the code, and try to job your memory. - -[source,python] ----- -import time -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -from selenium.webdriver.common.keys import Keys ----- - -[source,python] ----- -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) ----- - -[source,python] ----- -# navigate to the webpage -driver.get("https://purdue.edu/directory") - -# full page source -print(driver.page_source) - -# get html element -e = driver.find_element("xpath", "//html") - -# print html element -print(e.get_attribute("outerHTML")) - -# isolate the search bar "input" element -# important note: the following actually searches the entire DOM, not just the element e -inp = e.find_element("xpath", "//input") - -# to start with the element e and _not_ search the entire DOM, you'd do the following -inp = e.find_element("xpath", ".//input") -print(inp.get_attribute("outerHTML")) - -# use "send_keys" to type in the search bar -inp.send_keys("mdw") - -# just like when you use a browser, you either need to push "enter" or click on the search button. This time, we will press enter. -inp.send_keys(Keys.RETURN) - -# We can delay the program to allow the page to load -time.sleep(5) - -# get the table -table = driver.find_element("xpath", "//table[@class='more']") - -# print the table content -print(table.get_attribute("outerHTML")) ----- - -Use `selenium` to isolate and print out Dr. Ward's: alias, email, campus, department, and title. - -[TIP] -==== -The `following-sibling` axis may be useful here -- see: https://stackoverflow.com/questions/11657223/xpath-get-following-sibling. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Use `selenium` and its `click` method to first click the "VIEW MORE" link and then scrape and print: other phone, building, office, qualified name, and url. - -Take a look at the page source -- do you think clicking "VIEW MORE" was needed in order to scrape that data? Why or why not? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, finally, we are building some tools to help us analyze the housing market. https://zillow.com has extremely rich data on homes for sale, for rent, and lots of land. - -Click around and explore the website a little bit. Note the following. - -. Homes are typically list on the right hand side of the web page in a 21x2 set of "cards", for a total of 40 homes. -+ -[NOTE] -==== -At least in my experimentation -- the last row only held 1 card and there was 1 advertisement card, which I consider spam. -==== -. If you want to search for homes for sale, you can use the following link: `https://www.zillow.com/homes/for_sale/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: https://www.zillow.com/homes/for_sale/lafayette-in_rb. -. If you want to search for homes for rent, you can use the following link: `https://www.zillow.com/homes/for_rent/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: https://www.zillow.com/for_rent/lafayette-in_rb. -. If you load, for example, https://www.zillow.com/homes/for_rent/lafayette-in_rb and rapidly scroll down the right side of the screen where the "cards" are shown, it will take a fraction of a second for some of the cards to load. In fact, unless you scroll, those cards will not load and if you were to parse the page contents, you would not find all 40 cards are loaded. This general strategy of loading content as the user scrolls is called lazy loading. - -Write a function called `get_links` that, given a `search_term`, will return a list of property links for the given `search_term`. The function should both get all of the cards on a page, but cycle through all of the pages of homes for the query. - -[TIP] -==== -The following was a good query that had only 2 pages of results. - -[source,python] ----- -my_links = get_links("47933") ----- -==== - -[TIP] -==== -You _may_ want to include an internal helper function called `_load_cards` that accepts the driver and scrolls through the page slowly in order to load all of the cards. - -https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python[This] link will help! Conceptually, here is what we did. - -. Get initial set of cards using xpath expressions. -. Use `driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])` to scroll to the last card that was found in the DOM. -. Find cards again (now that more may have loaded after scrolling). -. If no more cards were loaded, exit. -. Update the number of cards we've loaded and repeat. -==== - -[TIP] -==== -Sleep 2 seconds using `time.sleep(2)` between every scroll or link click. -==== - -[TIP] -==== -After getting the links for each page, use `driver.delete_all_cookies()` to clear off cookies and help avoid captcha. -==== - -[TIP] -==== -If you using the link from the "next page" button to get the next page, instead, use `next_page.click()` to click on the link. Otherwise, you may get a captcha. -==== - -[TIP] -==== -Use something like: - -[source,python] ----- -with driver as d: - d.get(blah) ----- - -This way, after exiting the `with` scope, the driver will be properly closed and quit which will decrease the liklihood of you getting captchas. -==== - -[TIP] -==== -For our solution, we had a `while True:` loop in the `_load_cards` function and in the `get_links` function and used the `break` command in an if statement to exit. -==== - -[TIP] -==== -Need more help? Post in Piazza and I will help get you unstuck and give more hints. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc deleted file mode 100644 index 815ceef4e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project11.adoc +++ /dev/null @@ -1,452 +0,0 @@ -= TDM 40100: Project 11 -- 2022 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well. - -**Context:** This is the second in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data. - -**Scope:** selenium, Python, web scraping - -.Learning Objectives -**** -- Use selenium to interact with a web page prior to scraping. -- Use selenium and xpath expressions to efficiently scrape targeted data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -If you did not complete the previous project, we will provide you with the code for the `get_links` function on Monday, November 14th, below. - -[source,python] ----- -def get_links(search_term: str) -> list[str]: - """ - Given a search term, return a list of web links for all of the resulting properties. - """ - def _load_cards(driver): - """ - Given the driver, scroll through the cards - so that they all load. - """ - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - while True: - try: - num_cards = len(cards) - driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1]) - time.sleep(2) - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - if num_cards == len(cards): - break - num_cards = len(cards) - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue - - links = [] - url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/" - - firefox_options = Options() - # Headless mode means no GUI - firefox_options.add_argument("--headless") - firefox_options.add_argument("--disable-extensions") - firefox_options.add_argument("--no-sandbox") - firefox_options.add_argument("--disable-dev-shm-usage") - driver = webdriver.Firefox(options=firefox_options) - - with driver as d: - d.get(url) - d.delete_all_cookies() - while True: - time.sleep(2) - _load_cards(d) - links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")]) - next_link = d.find_element("xpath", "//a[@rel='next']") - if next_link.get_attribute("disabled") == "true": - break - url = next_link.get_attribute('href') - d.delete_all_cookies() - next_link.click() - - return links ----- - -There is a _lot_ of rich data on a home's page. If you want to gauge the housing market in an area or for a `search_term`, there are two pieces of data that could be particularly useful: the "Price history" and "Public tax history" components of the page. - -Check out https://zillow.com links for a couple different houses. - -Let's say you want to track the `date`, `event`, and `price` in a `price_history` table, and the `year`, `property_tax`, and `tax_assessment` in a `tax_history` table. - -Write 2 `CREATE TABLE` statements to create the `price_history` and `tax_history` tables. In addition, create a `houses` table where the `NUMBER_zpid` is the primary key, and `html`, which will store an HTML file. You can find the id in a house's link. For example, https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/ has the id `43641432_zpid`. - -Use `sqlite3` to create the tables in a database called `$HOME/houses.db`. You can do all of this from within Jupyter Lab. - -[source,ipython] ----- -%sql sqlite:///$HOME/houses.db ----- - -[source,ipython] ----- -%%sql - -CREATE TABLE ... ----- - -Run the following queries to confirm and show your table schemas. - -[source, sql] ----- -PRAGMA table_info(houses); ----- - -[source, sql] ----- -PRAGMA table_info(price_history); ----- - -[source, sql] ----- -PRAGMA table_info(tax_history); ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write a function called `link_to_blob` that takes a `link` and returns a `blob` of the HTML file. - -. Navigate to page. -. Sleep 2 seconds. -. Scroll so elements load up. (think "Price and tax history" and clicking "See complete tax history", and clicking "See complete price history", etc.) -. Create a `.html` file and `write` the driver's `page_source` to the file. -. Open the file in `rb` mode and use the `read` method to read the file into binary format. Return the binary format object. -. Delete the `.html` file from step (1). -. Quit the driver by calling `driver.quit()`. - -In addition, write a function called `blob_to_html` that accepts a blob (like what is returned from `link_to_blob`) and returns the string containing the HTML content. - -Demonstrate the functions by using `link_to_blob` to get the blob for a link, and then using `blob_to_html` to get the HTML content back from the returned value of `link_to_blob`. - -[IMPORTANT] -==== -Just print the first 500 characters of the results of `blob_to_html` to avoid cluttering your output. -==== - -[NOTE] -==== -If you are unsure how to do any of this -- please feel free to post in Piazza! -==== - -[TIP] -==== -Here is some skeleton code. The structure provided here works well for the problem. - -[source,python] ----- -import uuid -import os - -def link_to_blob(link: str) -> bytes: - def _load_tables(driver): - """ - Given the driver, scroll through the cards - so that they all load. - """ - # find price and tax history element using xpath - table = driver.find_element(...) - - # scroll the table into view - driver.execute_script(...) - - # sleep 2 seconds - time.sleep(2) - - try: - # find the "See complete tax history" button (if it exists) - see_more = driver.find_element(...) - - # click the button to reveal the rest of the history (if it exists) - see_more.click() - - except NoSuchElementException: - pass - - try: - # find the "See complete price history" button (if it exists) - see_more = driver.find_element(...) - - # click the button to reveal the rest of the history (if it exists) - see_more.click() - - except NoSuchElementException: - pass - - # create a .html file with a random name using the uuid package so there aren't collisions - filename = f"{uuid.uuid4()}.html" - - # open the file - with open(filename, 'w') as f: - firefox_options = Options() - # Headless mode means no GUI - firefox_options.add_argument("--headless") - firefox_options.add_argument("--disable-extensions") - firefox_options.add_argument("--no-sandbox") - firefox_options.add_argument("--disable-dev-shm-usage") - - driver = webdriver.Firefox(options=firefox_options) - driver.get(link) - time.sleep(2) - _load_tables(driver) - - # write the page source to the file - f.write(...) - driver.quit() - - # open the file in read binary mode - with open(filename, 'rb') as f: - # read the binary contents that are ready to be inserted into a sqlite BLOB - blob = f.read() - - # remove the file from the filesystem -- we don't need it anymore - os.remove(filename) - - return blob ----- -==== - -[TIP] -==== -Use this trick: https://the-examples-book.com/starter-guides/data-formats/xml#write-an-xpath-expression-to-get-every-div-element-where-the-string-abc123-is-in-the-class-attributes-value-as-a-substring for finding and clicking the “see more” buttons for the two tables. If you dig into the HTML youll see there is some text you can use to jump right to the two tables. - -To add to this, if instead of `@class, 'abc'` you use `text(), 'abc'` it will try to match the values between elements to "abc". For example, `//div[contains(text(), 'abc')]` will match `
abc
`. -==== - -[TIP] -==== -Remember the goal of this problem is to click the "see more" buttons (if they exist on a given page), and then just save the whole HTML page and convert it to binary for storage. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write functions that accept html content (as a string) and uses the `lxml.html` package to parse the HTML content and extract the various components for our `price_history` and `tax_history` tables. - -[TIP] -==== -My functions returned list of lists since the `sqlite3` python package will accept that format in an `executemany` statement. -==== - -[TIP] -==== -[source,python] ----- -import lxml.html - -tree = lxml.html.fromstring(blob_to_html(my_blob)) -tree.xpath("blah") ----- -==== - -[TIP] -==== -Here is some example output from my functions -- you do not need to match this if you have a better way to do it. - -[source,python] ----- -my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/") -get_price_history(blob_to_html(my_blob)) ----- - -Where - -[source,python] ----- -def blob_to_html(blob: bytes) -> str: - return blob.decode("utf-8") ----- - -.output ----- -[['11/9/2022', 'Price change', 275000], - ['11/2/2022', 'Listed for sale', 289900], - ['1/13/2000', 'Sold', 19000]] ----- - -[source,python] ----- -my_blob = link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/") -get_tax_history(blob_to_html(my_blob)) ----- - -.output ----- -[[2021, 1344, 124511], - [2020, 1310, 122792], - [2019, 1290, 120031], - [2018, 1260, 117793], - [2017, 1260, 115370], - [2016, 1252, 112997], - [2015, 1262, 112212], - [2014, 1277, 113120], - [2013, 1295, 112920], - [2012, 1389, 124535], - [2011, 1557, 134234], - [2010, 1495, 132251], - [2009, 1499, 128776], - [2008, 1483, 128647], - [2007, 1594, 124900], - [2006, 1608, 121900], - [2005, 1704, 118400], - [2004, 1716, 115000], - [2003, 1624, 112900], - [2002, 1577, 110300], - [2000, 288, 15700]] ----- -==== - -[TIP] -==== -Some skeleton hints if you want extra help. See discussion: https://piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark - -[source,python] ----- -def get_price_history(html: str): - tree = lxml.html.fromstring(html) - # xpath to find the price and tax history table - # then, you can use the xpath `following-sibling::div` to find the `div` that directly follows the - # price and tax history div (hint, look for "Price-and-tax-history" in the id attribute of a div element - # after the "following-sibling::div" part, look for elements with an id attribute - trs = tree.xpath(...) - values = [] - for tr in trs: - # xpath on the "tr" to find td with an inner span. Use string methods to remove the $ and remove the ",", and to remove trailing whitespace - price = tr.xpath(...)[2].text.replace(...).replace(...).strip() - - # if price is empty, make it None - if price == '': - price = None - - # append the values - values.append([tr.xpath(...)[0].text, tr.xpath(...)[1].text, price]) - - return values ----- -==== - -[TIP] -==== -More skeleton code help, if wanted. See discussion: https://piazza.com/class/l6usy14kpkk66n/post/lalzk6hi8ark - -[source,python] ----- -def get_tax_history(html: str): - tree = lxml.html.fromstring(html) - try: - # find the 'Price-and-tax-history' div, then, the following-sibling::div, then a table element, then a tbody element - tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1] - except IndexError: - return None - values = [] - # get the trs in the tbody - for tr in tbody.xpath(".//tr"): - # replace the $, ",", and "-", strip whitespace - prop_tax = tr.xpath(...)[1].text.replace(...).replace(...).replace(...).strip() - # if prop_tax is empty set to None - if prop_tax == '': - prop_tax = None - # add the data, for the last item in the list, remove $ and "," - values.append([int(tr.xpath(...)[0].text), prop_tax, int(tr.xpath(...)[2].text.replace(...).replace(...))]) - - return values ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Write code that uses the `get_links` function to get a list of links for a `search_term`. Process each link in the list and insert the retrieved data into your `houses.db` database. - -Once complete, run a couple queries that demonstrate that the data was successfully inserted into the database. - -[TIP] -==== -Here is some skeleton code to assist. - -[source,python] ----- -import sqlite3 -from urllib.parse import urlsplit -from tqdm.notebook import tqdm - -links = get_links("47933") - -# connect to database -con = sqlite3.connect(...) -for link in tqdm(links): # this shows a progress bar for assistance - - # use link_to_blob to get the blob - - # use urlsplit to extract the zpid from the link - - # add values to a tuple for insertion into the database - to_insert = (linkid, blob) - - # get a cursor - cur = con.cursor() - - # insert the data into the houses table using the cursor - - # get price history data to insert - to_insert = get_price_history(blob_to_html(blob)) - - # insert id into price history data - for val in to_insert: - val.insert(0, linkid) - - # insert the data into the price_history table using the cursor - - # prep the tax history data in the exact same way as price history - - # if there is tax history data, insert the ids just like before - - # insert the data into the tax_history table using the cursor - - # commit the changes - con.commit() - -# close the connection -con.close() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc deleted file mode 100644 index c1f1ffbc6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project12.adoc +++ /dev/null @@ -1,452 +0,0 @@ -= TDM 40100: Project 12 -- 2022 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well. - -**Context:** This is the third in a series of 4 projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous data mine courses. For this second project, we continue to build our suite of tools designed to scrape public housing data. - -**Scope:** playwright, Python, web scraping - -.Learning Objectives -**** -- Use playwright to interact with a web page prior to scraping. -- Use playwright and xpath expressions to efficiently scrape targeted data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -This has been (maybe) a bit intense for a project series. This project is going to give you a little break and not give you _anything_ new to do, except changing the package we are using. - -`playwright` is a modern web scraping tool backed by Microsoft, that, like `selenium`, allows you to interact with a web page before scraping. `playwright` is not necessarily better (yet), however, it _is_ different, and actively maintained. - -Implement the `get_links`, and `link_to_blob` functions using `playwright` instead of `selenium`. You can find the documentation for `playwright` xref:https://playwright.dev/python/docs/intro[here]. - -Before you get started, you will need to run the following in a `bash` cell. - -[source,ipython] ----- -%%bash - -python3 -m playwright install ----- - -Finally, we aren't going to force you to fight with the `playwright` documentation to get started, so the following is an example of code that will run in a Jupyter notebook, _and_ perform many of the basic/same operations you are acustomed to with `selenium`. - -[source,python] ----- -import time -import asyncio -from playwright.async_api import async_playwright - -# so we can run this from within Jupyter, which is already async -import nest_asyncio -nest_asyncio.apply() - -async def main(): - async with async_playwright() as p: - browser = await p.firefox.launch(headless=True) - context = await browser.new_context() - page = await context.new_page() - await page.goto("https://purdue.edu/directory") - - # print the page source - print(await page.content()) - - # get html element - e = page.locator("xpath=//html") - - # print the inner html of the element - print(await e.inner_html()) - - # isolate the search bar "input" element - inp = e.locator("xpath=.//input") - - # print the outer html, or the element and contents - print(await inp.evaluate("el => el.outerHTML")) - - # fill the input with "mdw" - await inp.fill("mdw") - print(await inp.evaluate("el => el.outerHTML")) - - # find the search button and click it - await page.locator("xpath=//a[@id='glass']").click() - - # We can delay the program to allow the page to load - time.sleep(5) - - # find the table in the page with dr. wards content - table = page.locator("xpath=//table[@class='more']") - - # print the table and contents - print(await table.evaluate("el => el.outerHTML")) - - # find the alias, if a selector starts with // or .. it is assumed to be xpath - print(await page.locator("//th[@class='icon-key']").evaluate("el => el.outerHTML")) - - # you can print an attribute - print(await page.locator("//th[@class='icon-key']").get_attribute("scope")) - - # similarly, you can print an elements content - print(await page.locator("//th[@class='icon-key']").inner_text()) - - # you could use the regular xpath stuff, no problem - print(await page.locator("//th[@class='icon-key']/following-sibling::td").inner_text()) - - await browser.close() - -asyncio.run(main()) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Implement the `get_links` function using `playwright`. Test it out so the exmaple below is the same (or close, listed houses may change). - -[TIP] -==== -Here is the `selenium` version. - -[source,python] ----- -def get_links(search_term: str) -> list[str]: - """ - Given a search term, return a list of web links for all of the resulting properties. - """ - def _load_cards(driver): - """ - Given the driver, scroll through the cards - so that they all load. - """ - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - while True: - try: - num_cards = len(cards) - driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1]) - time.sleep(2) - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - if num_cards == len(cards): - break - num_cards = len(cards) - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue - - links = [] - url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/" - - firefox_options = Options() - # Headless mode means no GUI - firefox_options.add_argument("--headless") - firefox_options.add_argument("--disable-extensions") - firefox_options.add_argument("--no-sandbox") - firefox_options.add_argument("--disable-dev-shm-usage") - driver = webdriver.Firefox(options=firefox_options) - - with driver as d: - d.get(url) - d.delete_all_cookies() - while True: - time.sleep(2) - _load_cards(d) - links.extend([e.get_attribute("href") for e in d.find_elements("xpath", "//a[@data-test='property-card-link' and @class='property-card-link']")]) - next_link = d.find_element("xpath", "//a[@rel='next']") - if next_link.get_attribute("disabled") == "true": - break - url = next_link.get_attribute('href') - d.delete_all_cookies() - next_link.click() - - return links ----- -==== - -[TIP] -==== -Use the `set_viewport_size` function to change the browser's width to 960 and height to 1080. -==== - -[TIP] -==== -Don't forget to `await` the async functions -- this is going to be the most likely source of errors. -==== - -[TIP] -==== -Unlike in `selenium`, in `playwright`, you won't be able to do something like this: - -[source,python] ----- -# wrong -cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]") -len(cards) # get the number of cards found ----- - -Instead, you'll have to use the useful https://playwright.dev/docs/api/class-locator#locator-count[`count`] function to get the nth element in the list of cards. - -[source,python] ----- -cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]") -num_cards = await cards.count() ----- -==== - -[TIP] -==== -Unlike in `selenium`, in `playwright`, you won't be able to do something like this: - -[source,python] ----- -# wrong -cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]") -await cards[num_cards-1].scroll_into_view_if_needed() ----- - -Instead, you'll have to use the useful https://playwright.dev/docs/api/class-locator#locator-nth[`nth`] function to get the nth element in the list of cards. - -[source,python] ----- -cards = page.locator("xpath=//article[starts-with(@id, 'zpid')]") -await cards.nth(num_cards-1).scroll_into_view_if_needed() ----- -==== - -[TIP] -==== -To clear cookies, search for "cookie" in the playwright documentation. Hint: you can clear cookies using the context object. -==== - -[TIP] -==== -This following provides a working skeleton to run the asynchronous code in Jupyter. - -[source,python] ----- -import time -import asyncio -from playwright.async_api import async_playwright, expect - -import nest_asyncio -nest_asyncio.apply() - -async def get_links(search_term: str) -> list[str]: - """ - Given a search term, return a list of web links for all of the resulting properties. - """ - async def _load_cards(page): - """ - Given the driver, scroll through the cards - so that they all load. - """ - pass - - links = [] - url = f"https://www.zillow.com/homes/for_sale/{'-'.join(search_term.split(' '))}_rb/" - async with async_playwright() as p: - browser = await p.firefox.launch(headless=True) - context = await browser.new_context() - page = await context.new_page() - - # code - - time.sleep(10) # useful if using headful mode (not headless) - await browser.close() - - return links - -my_links = asyncio.run(get_links("47933")) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Implement the `link_to_blob` function using `playwright`. Test it out so the example below functions. - -[TIP] -==== -The `selenium` version will be posted below on Monday, November 21. - -[source,python] ----- -import uuid -import os - -def link_to_blob(link: str) -> bytes: - def _load_tables(driver): - """ - Given the driver, scroll through the cards - so that they all load. - """ - table = driver.find_element("xpath", "//div[@id='Price-and-tax-history']") - driver.execute_script('arguments[0].scrollIntoView();', table) - time.sleep(2) - try: - see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete tax history')]") - see_more.click() - except NoSuchElementException: - pass - try: - see_more = driver.find_element("xpath", "//span[contains(text(), 'See complete price history')]") - see_more.click() - except NoSuchElementException: - pass - - filename = f"{uuid.uuid4()}.html" - with open(filename, 'w') as f: - firefox_options = Options() - # Headless mode means no GUI - firefox_options.add_argument("--headless") - firefox_options.add_argument("--disable-extensions") - firefox_options.add_argument("--no-sandbox") - firefox_options.add_argument("--disable-dev-shm-usage") - - driver = webdriver.Firefox(options=firefox_options) - driver.get(link) - time.sleep(2) - _load_tables(driver) - f.write(driver.page_source) - driver.quit() - - with open(filename, 'rb') as f: - blob = f.read() - - os.remove(filename) - - return blob ----- -==== - -[TIP] -==== -The `get_price_history` and `get_tax_history` solutions will be posted below on Monday, Novermber 21. - -[source,python] ----- -import lxml.html - -def get_price_history(html: str): - tree = lxml.html.fromstring(html) - trs = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//tr[@id]") - values = [] - for tr in trs: - price = tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", "").strip() - if price == '': - price = None - values.append([tr.xpath(".//td/span")[0].text, tr.xpath(".//td/span")[1].text, price]) - - return values - - -def get_tax_history(html: str): - tree = lxml.html.fromstring(html) - try: - tbody = tree.xpath("//div[@id='Price-and-tax-history']/following-sibling::div//table//tbody")[1] - except IndexError: - return None - values = [] - for tr in tbody.xpath(".//tr"): - prop_tax = tr.xpath(".//td/span")[1].text.replace("$", "").replace(",", "").replace("-", "").strip() - if prop_tax == '': - prop_tax = None - values.append([int(tr.xpath(".//td/span")[0].text), prop_tax, int(tr.xpath(".//td/span")[2].text.replace("$", "").replace(",", ""))]) - - return values ----- -==== - -[TIP] -==== -To test your code run the following. - -[source,python] ----- -my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")) - -def blob_to_html(blob: bytes) -> str: - return blob.decode("utf-8") - -get_price_history(blob_to_html(my_blob)) ----- - -.output ----- -[['11/9/2022', 'Price change', '275000'], - ['11/2/2022', 'Listed for sale', '289900'], - ['1/13/2000', 'Sold', '19000']] ----- - -[source,python] ----- -my_blob = asyncio.run(link_to_blob("https://www.zillow.com/homedetails/2180-N-Brentwood-Cir-Lecanto-FL-34461/43641432_zpid/")) - -def blob_to_html(blob: bytes) -> str: - return blob.decode("utf-8") - -get_tax_history(blob_to_html(my_blob)) ----- - -.output ----- -[[2021, '1344', 124511], - [2020, '1310', 122792], - [2019, '1290', 120031], - [2018, '1260', 117793], - [2017, '1260', 115370], - [2016, '1252', 112997], - [2015, '1262', 112212], - [2014, '1277', 113120], - [2013, '1295', 112920], - [2012, '1389', 124535], - [2011, '1557', 134234], - [2010, '1495', 132251], - [2009, '1499', 128776], - [2008, '1483', 128647], - [2007, '1594', 124900], - [2006, '1608', 121900], - [2005, '1704', 118400], - [2004, '1716', 115000], - [2003, '1624', 112900], - [2002, '1577', 110300], - [2000, '288', 15700]] ----- - -Please note that exact numbers may change slightly, that is okay! Prices and things change. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Test out a `playwright` feature from the https://playwright.dev/python/docs/intro[documentation] that is new to you. This could be anything. One suggestion that could be interesting would be screenshots. As long as you demonstrate _something_ new, you will receive credit for this question. Have fun, and happy thanksgiving! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc deleted file mode 100644 index 2219f8119..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-project13.adoc +++ /dev/null @@ -1,58 +0,0 @@ -= TDM 40100: Project 13 -- 2022 - -**Motivation:** It has been a long semester! In this project, we want to give you some flexibility to explore and utilize some of the skills you've previously learned in the course. You will be given 4 options to choose from. Please note that we do not expect perfect submissions, but rather a strong effort in line with a typical project submission. - -**Context:** This is the final project for TDM 40100, where you will choose from 4 options that each exercise some skills from the semester and more. - -**Scope:** Python, sqlite3, playwright, selenium, pandas, matplotlib, and more. - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -Choose from one of the following options: - -=== Option 1 - -Use the provided functions and your sqlite skills to scrape and store 1000+ homes in an area of your choice. Use the data you stored in your database to perform an analysis of your choice. Examples of potentially interesting questions you could ask: - - What percentage of homes have "fishy" histories? For example, a home for sale on the market for too long is viewed as "bad". You may notice homes being marked as "sold" and immediately put back on the market. This refreshes Zillow's data and makes it look like the home is new to the market, when in fact it is not. - - For your area, what is the average time on the market before the home is sold? What is the average price drop, and after how many days does the price drop occur? - -=== Option 2 - -Use the provided functions and libraries like `argparse` or https://typer.tiangolo.com/[`typer`] to build a CLI to make zillow queries and display data. Please incorporate at least 1 of the following "extra" features: - - Color your output using `rich`. - - Or containerize your application using https://docs.sylabs.io/guides/3.5/user-guide/build_a_container.html#building-containers-from-singularity-definition-files[singularity]. - - Or use `sqlite3` to cache the HTML blobs -- if the blob for a home or query is not older than 1 day, then use the cached version instead of making a new request. - -=== Option 3 - -Abandon the housing altogether and instead have some FIFA fun. Scrape data from https://fbref.com/en/ and choose from two very similar projects. - - Write `playwright` or `selenium` code to scrape data from https://fbref.com. Scrape 1000+ structured pieces of information and store it in a database to perform an analysis of your choice. Examples could be: - - Can you find any patterns that may indicate promising players under the age of 21 by looking at currently successful players data when they were young? - - What country produces the most talent (by some metric you describe)? - - - Build a CLI to make queries and display data. Please incorporate at least 1 of the following "extra" features: - - Color your output using `rich`. - - Or containerize your application using https://docs.sylabs.io/guides/3.5/user-guide/build_a_container.html#building-containers-from-singularity-definition-files[singularity]. - - Or use `sqlite3` to cache the HTML blobs -- if the blob for a home or query is not older than 1 day, then use the cached version instead of making a new request. - -=== Option 4 - -Have another idea that utilizes the same skillsets? Please post it in Piazza to get approval from Kevin or Dr. Ward. - -.Items to submit -==== -- A markdown cell describing the option(s) you chose to complete for this project and why you chose it/them. -- If you chose to scrape 1000+ bits of data, 2 SQL cells: 1 that demonstrates a sample of your data (for instance 5 rows printed out), and 1 that shows that you've scraped 1000+ records. -- If you chose to scrape 1000+ bits of data, an analysis complete with your problem statement, how you chose to solve the problem, and your code and analysis, with at least 2 included visualizations, and a conclusion. -- If you chose to build a CLI, a markdown cell describing the CLI and how to use it, and the options it has. -- Screenshots demonstrating the capabilities of your CLI and the extra feature(s) you chose to implement. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc deleted file mode 100644 index fdb4ca147..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/40100/40100-2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= TDM 40100 - -== Project links - -[NOTE] -==== -Only the best 10 of 13 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$40100-2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:current-projects:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm40100[https://piazza.com/purdue/fall2022/tdm40100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm40100/home[https://piazza.com/purdue/fall2022/tdm40100/home] - -== Syllabus - -See xref:fall2022/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc deleted file mode 100644 index 9fb1c4f44..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/office_hours.adoc +++ /dev/null @@ -1,34 +0,0 @@ -= Office Hours Fall 2022 - -[IMPORTANT] -==== -Check here to find the most up to date office hour schedule. -==== - -[NOTE] -==== -**Office hours _during_ seminar:** Hillenbrand C141 -- the atrium inside the dining court + -**Office hours _outside_ of seminar, before 5:00 PM EST:** Hillenbrand Lobby C100 -- the lobby between the 2 sets of front entrances + -**Office hours _after_ 5:00 PM EST:** Online in Webex + -**Office hours on the _weekend_:** Online in Webex -==== - -Navigate between tabs to view office hour schedules for each course and find Webex links to online office hours. -++++ - -++++ - - -== About the Office Hours in The Data Mine - -During Fall 2022, office hours will be in person in Hillenbrand Hall during popular on-campus hours, and online via Webex during later hours (starting at 5:00PM). Each TA holding an online office hour will have their own WebEx meeting setup, so students will need to click on the appropriate WebEx link to join office hours. In the meeting room, the student and the TA can share screens with each other and have vocal conversations, as well as typed chat conversations. You will need to use the computer audio feature, rather than calling in to the meeting. There is a WebEx app available for your phone, too, but it does not have as many features as the computer version. - -The priority is to have a well-staffed set of office hours that meets student traffic needs. **We aim to have office hours when students need them most.** - -Each online TA meeting will have a maximum of 7 other people able to join at one time. Students should enter the meeting room to ask their question, and when their question is answered, the student should leave the meeting room so that others can have a turn. Students are welcome to re-enter the meeting room when they have another question. If a TA meeting room is full, please wait a few minutes to try again, or try a different TA who has office hours at the same time. - -Students can also use Piazza to ask questions. The TAs will be monitoring Piazza during their office hours. TAs should try and help all students, regardless of course. If a TA is unable to help a student resolve an issue, the TA might help the student to identify an office hour with a TA that can help, or encourage the student to post in Piazza. - -The weekly projects are due on Friday evenings at 11:55 PM through Gradescope in Brightspace. All the seminar times are on Mondays. New projects are released on Thursdays, so students have 8 days to work on each project. - -All times listed are Purdue time (Eastern). diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc deleted file mode 100644 index 290329188..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/schedule.adoc +++ /dev/null @@ -1,118 +0,0 @@ -= Fall 2022 Course Schedule - Seminar - -Assignment due dates are listed in *BOLD*. Other dates are important notes. - -*Remember, only your top 10 out of 13 project scores are factored into your final grade. - -[cols="^.^1,^.^3,<.^12"] -|=== - -|*Week* |*Date* ^.|*Activity* - -|1 -|8/22 - 8/26 -|Monday, 8/22: First day of fall 2022 classes - - - -|2 -|8/29 - 9/9 -| -*Project #1 due on Gradescope by 11:59 PM ET on Friday, 9/9* - -*Syllabus Quiz due on Gradescope by 11:59 PM ET on Friday, 9/9* - -*Academic Integrity Quiz due on Gradescope by 11:59 PM ET on Friday, 9/9* - -*Project #2 due on Gradescope by 11:59 PM ET on Friday, 9/9* - - -|3 -|9/5 - 9/9 -|Monday, 9/5: Labor Day, no classes - - - -|4 -|9/12 - 9/16 -| -*Project #3 due on Gradescope by 11:59 PM ET on Friday, 9/16* - - - -|5 -|9/19 - 9/23 -| -*Project #4 due on Gradescope by 11:59 PM ET on Friday, 9/23* -*Outside Event #1 due on Gradescope by 11:59 PM ET on Friday, 9/23* - - -|6 -|9/26 - 9/30 -| *Project #5 due on Gradescope by 11:59 PM ET on Friday, 9/30* - - -|7 -|10/3 - 10/7 -|*Project #6 due on Gradescope by 11:59 PM ET on Friday, 10/7* - - -|8 -|10/10 - 10/14 -|Monday & Tuesday, 10/10 - 10/11 October Break - -|9 -|10/17 - 10/21 -| -*Project #7 due on Gradescope by 11:59 PM ET on Friday, 10/21* - -*Outside Event #2 due on Gradescope by 11:59 PM ET on Friday, 10/21* - -|10 -|10/24 - 10/28 -| -*Project #8 due on Gradescope by 11:59 PM ET on Friday, 10/28* - -|11 -|10/31 - 11/4 -| -*Project #9 due on Gradescope by 11:59 PM ET on Friday, 11/4* - -|12 -|11/7 - 11/11 -| -*Project #10 due on Gradescope by 11:59 PM ET on Friday, 11/11* - - -|13 -|11/14 - 11/18 -| -*Project #11 due on Gradescope by 11:59 PM ET on Friday, 11/18* -*Outside Event #3 due on Gradescope by 11:59 PM ET on Friday, 11/18* - -|14 -|11/21 - 11/25 -|Wednesday - Friday, 11/23 - 11/25: Thanksgiving Break - - -|15 -|11/28 - 12/2 -| -*Project #12 due on Gradescope by 11:59 PM ET on Friday, 12/2* - -|16 -|12/5 - 12/9 -| -*Project #13 due on Gradescope by 11:59 PM ET on Friday, 12/9* - -| -|12/12 - 12/16 -|Final Exam Week - There are no final exams in The Data Mine. - - -| -|12/20 -|Tuesday, 12/20: Fall 2022 grades are submitted to Registrar's Office by 5 PM Eastern - - -|=== diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc deleted file mode 100644 index 08648bc66..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus.adoc +++ /dev/null @@ -1,269 +0,0 @@ -= Fall 2022 Syllabus - The Data Mine Seminar - -== Course Information - - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10100 - The Data Mine I, possible CRNs are 12067 or 12072 or 12073 or 12071 -TDM 20100 - The Data Mine III, possible CRNs are 12117 or 12106 or 12113 or 12118 -TDM 30100 - The Data Mine V, possible CRNs are 12104 or 12112 or 12115 or 12120 -TDM 40100 - The Data Mine VII, possible CRNs are 12103 or 12111 or 12114 or 12119 -TDM 50100 - The Data Mine Seminar, CRN 15644 -|=== - -*Course credit hours:* 1 credit hour, so you should expect to spend about 3 hours per week doing work -for the class - -*Prerequisites:* -None for TDM 10100. All students, regardless of background are welcome. Typically, students new to The Data Mine sign up for TDM 10100, students in their second, third, or fourth years of The Data Mine sign up for TDM 20100, TDM 30100, and TDM 40100, respectively. TDM 50100 is geared toward graduate students. However, during the first week of the semester (only), if a student new to The Data Mine has several years of data science experience and would prefer to switch from TDM 10100 to TDM 20100, we can make adjustments on an individual basis. - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. -- link:https://datamine.purdue.edu[*The Data Mine's website*] - helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium--no meal swipe required), and 4:30 pm (link:https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later). Attendance is not required. - -All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings, but this is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -1. Discover data science and professional development opportunities in order to prepare for a career. -2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -3. Design efficient search strategies in order to acquire new data science skills. -4. Devise the most appropriate data science strategy in order to answer a research question. -5. Apply data science techniques in order to answer a research question about a big data set. - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Brightspace and Gradescope course pages. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - -=== Attendance Policy - -Attendance is not required. - -When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. - -For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor or TA as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website. - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don't wait until the due date to start! -* Be excited to challenge yourself and learn impressive new skills. Don't get discouraged if something is difficult--you're here because you want to learn, not because you already know everything! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don't see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don't forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you -should keep a copy of the course schedule where you see it easily. -* Read your emails! - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title -Shared email we all read, datamine-help@purdue.edu -Kevin Amstutz, Senior Data Scientist and Instruction Specialist -Maggie Betz, Managing Director of Corporate Partnerships -Shuennhau Chang, Corporate Partners Senior Manager -David Glass, Managing Director of Data Science -Kali Lacy, Associate Research Engineer -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator -Kim Rechkemmer, Senior Program Administration Specialist -Nick Rosenorn, Corporate Partners Technical Specialist -Katie Sanders, Operations Manager -Rebecca Sharples, Managing Director of Academic Programs & Outreach -Dr. Mark Daniel Ward, Director - -|=== - -The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses. - -*For the purposes of getting help with this 1-credit seminar class, your most important people are:* - -* *T.A.s*: Visit their xref:fall2022/logistics/office_hours.adoc[office hours] and use the link:https://piazza.com/[Piazza site] -* *Mr. Kevin Amstutz*, Senior Data Scientist and Instruction Specialist - Piazza is preferred method of questions -* *Dr. Mark Daniel Ward*, Director: Dr. Ward responds to questions on Piazza faster than by email - - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest response by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:fall2022/logistics/office_hours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available schedule. - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes! - - -== Assignments and Grades - - -=== Course Schedule & Due Dates - -xref:fall2022/logistics/schedule.adoc[Click here to view the Fall 2022 Course Schedule] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the Friday at 11:55 pm Purdue West Lafayette (Eastern) time. You will need to do 3 Outside Event reflections. - -We will have 13 weekly projects available, but we only count your best 10. This means you could miss up to 3 projects due to illness or other reasons, and it won't hurt your grade. We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren't done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions. - -The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the link:https://catalog.purdue.edu/content.php?catoid=15&navoid=18634#academic-calendar[Academic Calendar]. - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-13) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - -=== Grading Scale -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+. - -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - - -=== Late Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 13, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can't do this if we are waiting for other students to turn their work in. - - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 13 projects available over the semester, and your best 10 will count. -* The 3 project grades that are dropped could be from illnesses, absences, travel, family -emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the -last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office -hours. We're waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in -Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the -comment sections of your project submission. You need to say who helped you and how they -helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be -the topic of the seminar and any office hours that week, and then the project will be due by -11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature -inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. Feel free to complete them early. -** Outside Event Reflections *must* be submitted within 1 week of attending the event or watching the recording. -** At least one of these events should by on the topic of Professional Development (designated by "PD" on the schedule) -* Find outside events posted on The Data Mine's website (https://datamine.purdue.edu/events/) and updated frequently. Let us know about any good events you hear about. -* Format of Outside Events: -** Often in person so you can interact with the presenter! -** Occasionally online and possibly recorded -* Follow the instructions in Gradescope for writing and submitting these reflections. -*** Name of the event and speaker -*** The time and date of the event -*** What was discussed at the event -*** What you learned from it -*** What new ideas you would like to explore as a result of what you learned at the event -*** AND what question(s) you would like to ask the presenter if you met them at an after-presentation reception. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10100/20100/30100/40100/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who -have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody's completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I'll do #2, and he'll do #3: then we'll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an -unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else's computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else's answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else's outside event reflection rather than attending the event yourself. - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. - -== xref:fall2022/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources] -Includes: - -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Academic Guidance in the Event a Student is Quarantined/Isolated[Academic Guidance in the Event a Student is Quarantined/Isolated] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Class Behavior[Class Behavior] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Nondiscrimination Statement[Nondiscrimination Statement] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Students with Disabilities[Students with Disabilities] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Mental Health Resources[Mental Health Resources] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Violent Behavior Policy[Violent Behavior Policy] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Diversity and Inclusion Statement[Diversity and Inclusion Statement] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Basic Needs Security Resources[Basic Needs Security Resources] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Course Evaluation[Course Evaluation] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#General Classroom Guidance Regarding Protect Purdue[General Classroom Guidance Regarding Protect Purdue] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Campus Emergencies[Campus Emergencies] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Illness and other student emergencies[Illness and other student emergencies] -* xref:fall2022/logistics/syllabus_purdue_policies.adoc#Disclaimer[Disclaimer] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc deleted file mode 100644 index ed37cd3d7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/syllabus_purdue_policies.adoc +++ /dev/null @@ -1,75 +0,0 @@ -== Purdue Policies & Resources - -=== Academic Guidance in the Event a Student is Quarantined/Isolated -While everything we are doing in The Data Mine this semester can be done online, rather than in person, and no part of your seminar grade comes from attendance, we want to remind you of general campus attendance policies during COVID-19. Students should stay home and contact the Protect Purdue Health Center (496-INFO) if they feel ill, have any symptoms associated with COVID-19, or suspect they have been exposed to the virus. In the current context of COVID-19, in-person attendance will not be a factor in the final grades, but the student still needs to inform the instructor of any conflict that can be anticipated and will affect the submission of an assignment. Only the instructor can excuse a student from a course requirement or responsibility. When conflicts can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. For unanticipated or emergency conflict, when advance notification to an instructor is not possible, the student should contact the instructor as soon as possible by email or by phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor's department because of circumstances beyond the student's control, and in cases of bereavement, quarantine, or isolation, the student or the student's representative should contact the Office of the Dean of Students via email or phone at 765-494-1747. Below are links on Attendance and Grief Absence policies under the University Policies menu. - -If you must miss class at any point in time during the semester, please reach out to me via email so that we can communicate about how you can maintain your academic progress. If you find yourself too sick to progress in the course, notify your adviser and notify me via email or Brightspace. We will make arrangements based on your particular situation. Please note the link:https://protect.purdue.edu/updates/video-update-protect-purdue-fall-expectations/[Protect Purdue fall 2022 expectations] announced on the Protect Purdue website. - -=== Class Behavior - -You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties--this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group. - - -*Purdue's Copyrighted Materials Policy:* - -Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally. -Notes taken in class are, however, generally considered to be "derivative works" of the instructor's presentations and materials, and they are thus subject to the instructor's copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose. - -=== Nondiscrimination Statement -Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue's nondiscrimination policy statement.] - -=== Students with Disabilities -Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247. - -If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you. - -=== Mental Health Resources - -* *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time. -* *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm. -* *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu. -* *If you're struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours. - -=== Violent Behavior Policy - -Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University's full violent behavior policy] for more detail. - -=== Diversity and Inclusion Statement - -In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points: - -* We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable. -* Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker's intention. -* We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves. - -=== Basic Needs Security Resources - -Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 - 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community). - -Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund]. - -=== Course Evaluation - -During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system. - -You may email feedback to us anytime at link:mailto:datamine-help@purdue.edu[datamine-help@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you! - -=== General Classroom Guidance Regarding Protect Purdue - -Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:https://purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace. - -=== Campus Emergencies - -In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course: - -* Brightspace or by e-mail from Data Mine staff. -* General information about a campus emergency can be found on the Purdue website: xref:www.purdue.edu[]. - - -=== Illness and other student emergencies - -Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students. - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. - diff --git a/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc b/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc deleted file mode 100644 index db47547ed..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2022/logistics/ta_schedule.adoc +++ /dev/null @@ -1,6 +0,0 @@ -= Seminar TA Fall 2022 Schedule - -++++ - -++++ \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc deleted file mode 100644 index e7e5bf012..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project01.adoc +++ /dev/null @@ -1,409 +0,0 @@ -= TDM 10100: Project 1 -- 2023 - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -[NOTE] -==== -IDE stands for Integrated Developer Environment: software that helps us program cleanly and efficiently. -==== - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** R, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Read and write basic (.csv) data using R. -**** - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` -- `/anvil/projects/tdm/data/disney/flight_of_passage.csv` - -== Setting Up to Work - -++++ - -++++ - - -This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.) - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory. - -[NOTE] -==== -As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough. -==== - -[NOTE] -==== -Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -seminar:: -The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -seminar-r:: -The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you. - - -The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work). - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -[TIP] -==== -Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -==== - - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts) -.. How many cores and how much memory (in GB) does your personal computer have? (0.5 pts) - -++++ - -++++ - - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to manually calculate how many cores and how much memory is available for Anvil's "sub-cluster A". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[TIP] -==== -Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here]. - -Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings. -==== - -.Items to submit -==== -- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A. -- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 (2 pts) -[upperalpha] -.. Using Python, what is the name of the node on Anvil you are running on? -.. Using Bash, what is the name of the node on Anvil you are running on? -.. Using R, what is the name of the node on Anvil you are running on? - -++++ - -++++ - -Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[source,r] ----- -%%R - -system("hostname", intern=TRUE) ----- - -[source,bash] ----- -%%bash - -hostname ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about. (Different languages have different types of output.) - -.Items to submit -==== -- Code used to solve this problem, along with the output of running that code. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question. - -++++ - -++++ - - -[TIP] -==== -Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know! -==== - -In this course, we will be using Jupyter Lab with multiple different languages. Often, we will center a project around a specific language and choose the kernel for that langauge appropriately, but occasionally we may need to run a language in a kernel other than the one it is primarily built for. The solution to this is using line magic! - -Line magic tells our code interpreter that we are using a language other than the default for our kernel (i.e. The `seminar` kernel we are currently using is expecting Python code, but we can tell it to expect R code instead.) - -Line magic works by having the very first line in a code cell formatted like so: - -`%%language` - -Where `language` is the language we want to use. For example, if we wanted to run R code in our `seminar` kernel, we would use the following line magic: - -`%%R` - -Practice running the following examples, which include line magic where needed. - -python:: -[source,python] ----- -import pandas as pd -df = pd.read_csv('/anvil/projects/tdm/data/flights/subset/1991.csv') ----- - -[source,python] ----- -df[df["Month"]==12].head() # see information about a few of the flights from December 1991 ----- - -SQL:: -[source, ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, sql] ----- -%%sql - --- see information about a few TV episodes called "Finale" -SELECT * -FROM episodes AS e -INNER JOIN titles AS t -ON t.title_id = e.episode_title_id -WHERE t.primary_title = 'Finale' -LIMIT 5; ----- - -bash:: -[source,bash] ----- -%%bash - -names="John Doe;Bill Withers;Arthur Morgan;Mary Jane;Rick Ross;John Marston" -echo $names | cut -d ';' -f 3 -echo $names | cut -d ';' -f 6 ----- - -[NOTE] -==== -In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell). - -In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know! - -To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code from the examples above, and the outputs produced by running that code. -==== - -=== Question 4 (1 pt) -[upperalpha] -.. Using Python, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts) -.. Using R, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts) - - -++++ - -++++ - - -[NOTE] -==== -"Comments" are text in code cells that are not "run" as code. They serve as helpful notes on how your code works. Always comment your code well enough that you can come back to it after a long amount of time and understand what you wrote. In R and Python, single-line comments can be made by putting `#` at the beginning of the line you want commented out. -==== - -[NOTE] -==== -Spacing in code is sometimes important, sometimes not. The two things you can do to find out what applies in your case are looking at documentation online and experimenting on your own, but we will also try to stress what spacing is mandatory and what is a style decision in our videos. -==== - -In question 1 we answered questions about cores and memory for the Anvil clusters. This time, we want you to convert your GB memory amount from question 1 into bytes and terabytes. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells. - -[TIP] -==== -A Gigabyte is 1,000,000,000 bytes. -A Terabyte is 1,000 Gigabytes. -==== - -[TIP] -==== -https://www.datamentor.io/r-programming/operator[This link] will point you to resources about how to use basic operators in R, and https://www.tutorialspoint.com/python/python_basic_operators.htm[this one] will teach you about basic operators in Python. -==== - -.Items to submit -==== -- Python code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -- R code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -==== - -=== Question 5 (2 pts) -[upperalpha] -.. Load the "flight_of_passage.csv" data into an R dataframe called "dat". (0.5 pts) -.. Take the head of "dat" to ensure your data loaded in correctly. (0.5 pts) -.. Change the name of "dat" to "flight_of_passage", remove the reference to "dat", and then take the head of "flight of passage" in order to ensure that your actions were successful. (1 pt) - - -++++ - -++++ - - -In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code. - -[source,ipython] ----- -%%R - -dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv") ----- - -Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset. - -[source,r] ----- -%%R - -head(dat) ----- - -[IMPORTANT] -==== -Remember -- if you are in a _new_ code cell on the , you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good! -==== - -`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`. - -Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure there is a level 2 header titled "Question 5", above your code cell. - -[TIP] -==== -In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. -==== - -[NOTE] -==== -We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this. - -[source,r] ----- -flight_of_passage <- dat ----- - -Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5GB dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset. - -[source,r] ----- -rm(dat) ----- -==== - -.Items to submit -==== -- Code to load the data into a dataframe called `dat` and take the head of that data, and the output of that code. -- Code to change the name of `dat` to `flight_of_passage` and remove the variable `dat`, and to take the head of `flight_of_passage` to ensure the name-change worked. -==== - -=== Submitting your Work - - -++++ - -++++ - - -Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope. - -.Items to submit -==== -- `firstname-lastname-project01.ipynb`. -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.sql`. -- `firstname-lastname-project01.sh`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc deleted file mode 100644 index edaa5f3d4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project02.adoc +++ /dev/null @@ -1,301 +0,0 @@ -= TDM 10100: Project 2 -- 2023 -Introduction to R part I - -In this project we will dive in head-first and learn some of the basics while solving data-driven problems. - - -[NOTE] -==== -**5 Basic Types of Data** - - * Values like 1.5 are called numeric values, real numbers, decimal numbers, etc. - * Values like 7 are called integers or whole numbers. - * Values TRUE or FALSE are called logical values or Boolean values. - * Texts consist of sequences of words (also called strings), and words consist of sequences of characters. - * Values such as 3 + 2ifootnote:[https://stat.ethz.ch/R-manual/R-devel/library/base/html/complex.html] are called complex numbers. We usually do not encounter these in The Data Mine. -==== - - - -[NOTE] -==== -R and Python both have their advantages and disadvantages. A key part of learning data science methods is to understand the situations in which R is a more helpful tool to use, or Python is a more helpful tool to use. Both of them are good for their own purposes. In a similar way, hammers and screwdrivers and drills and many other tools are useful for construction, but they all have their own individual purposes. - -In addition, there are many other languages and tools, e.g., https://julialang.org/[Julia] and https://www.rust-lang.org/[Rust] and https://go.dev/[Go] and many other languages are emerging as relatively newer languages that each have their own advantages. -==== - -**Context:** In the last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some examples. - -In this project, we will continue to use R within Jupyter Lab to solve problems. Soon, you will see how powerful R is and why it is often more effective than using spreadsheets as a tool for data analysis. - -**Scope:** xref:programming-languages:R:index.adoc[R], xref:programming-languages:R:lists-and-vectors.adoc[vectors, lists], https://rspatial.org/intr/4-indexing.html[indexing] - -.Learning Objectives -**** -- Be aware of the different concepts and when to apply them; such as lists, vectors, factors, and data.frames -- Be able to explain and demonstrate: positional, named, and logical indexing. -- Read and write basic (csv) data using R. -- Identify good and bad aspects of simple plots. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset: - -- `/anvil/projects/tdm/data/flights/subset/1995.csv` - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. How many columns does this data frame have? (0.25 pts) -.. How many rows does this data frame have? (0.25 pts) -.. What type/s of data are in this data frame (example: numerical values, and/or text strings, etc.) (0.5 pts) - -[TIP] -==== -"Kernel died" is a common error you could encounter during the semester. If you get a pop-up that says your "Kernel Died," it typically means that either 1) Anvil is down. Be sure to check your email and Piazza for updates, or 2) You need more cores for your project. Try starting a new session with an additional core. If you are using more than 4 cores, the problem is NOT this. -==== -[TIP] -==== -For this project, you will probably need to reserve 2-3 cores. Also, remember to use the `seminar-r` kernel going forward in this class. -==== - - -++++ - -++++ - - -It is important to get a good understanding of the dataset(s) with which you are working. This is the best first step to help solve any data-driven problems. - -We are going to use the `read.csv()` function to load our datasets into a data frame. - -To read data in R from a CSV file (.csv), you use the following command: - -[source,r] - ----- -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1995.csv") ----- - -[TIP] -==== -R is a case-sensitive language, so if you try and take the head of `mydf` instead of `myDF`, it will not work. -==== - -[NOTE] -==== -Here `myDF` is a variable - a name that references our data frame. In practice, you should always use names that are specific, descriptive, and meaningful. -==== - -We want to use functions such as `head`, `tail`, `dim`, `summary`, `str`, `class`, to get a better understanding of our data frame. - -[TIP] -==== -- `head(myDF)` - Look at the head (or top) of the data frame - -- `tail(myDF)` - Look at the tail (or bottom) of the data frame - -- `class(myDF$Dest)` - Return the type of data in a column of the data frame, for instance, in a column that stores the destination of flights (Dest) - -- Try and figure out `dim`, `summary`, and `str` on your own, but we give some details about them in the video as well. -==== - -.Items to submit -==== -- Code used to solve sub-questions A, B, and C, and the output from running that code. -- The number of columns and rows in the data frame, in a markdown cell. -- A list of all of the types of data present in the data frame, in a markdown cell. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. What type of data is in the vector `myairports`? (0.5 pts) -.. The vector `myairports` contains all of the airports where flights departed from in 1995. Print the first 250 of those airports. (Do not print all of the airports, because there are 5327435 such values!) How many of the first 250 flights departed from O'Hare? (0.5 pts) - - -++++ - -++++ - - -[NOTE] -==== -A vector is a simple way to store a sequence of data. The data can be numeric data, logical data, textual data, etc. -==== - -Let's create a new https://sudo-labs.github.io/r-data-science/vectors/[vector] called `myairports` containing all of the origin airports (i.e., the airports where the flights departed) from the column `myDF$Origin` of the data frame `myDF`. We can do this using the `$` operator. Documentation on the `$` operator can be found https://statisticsglobe.com/meaning-of-dollar-operator-in-r[here], and an example of how to use it is given below. - -[source,r] ----- -newVector <- myDF$ColumnName - -# to generate our vector, this would look like -my_airports <- myDF$Origin ----- - -[TIP] -==== -The `head()` function may help you with part B of this question. -==== - -.Items to submit -==== -- Code used to create `myairports` and to solve the above sub-questions, and the output from running that code. -- The type of data in your `myairports` vector in a markdown cell. -- The number of flights that are from O'Hare in the first 250 entries of your `myairports` vector, in a markdown cell. -==== - -=== Question 3 (2 pts) - -[upperalpha] -.. How many flights departed from Indianapolis (`IND`) in 1995? How many flights landed in Indianapolis (`IND`) in 1995? (1 pt) -.. Consider the flight data from row 894 the data frame. What airport did it depart from? Where did it arrive? (0.5 pts) -.. How many flights have a distance of less than 200 miles? (0.5 pts) - - -++++ - -++++ - - -There are many different ways to access data after we load it, and each has its own use case. One of the most common ways to access data is called _indexing_. Indexing is a way of selecting or excluding specific elements in our data. This is best shown through examples, some of which can be found https://rspatial.org/intr/4-indexing.html[here]. - -[NOTE] -==== -Accessing data can be done in many ways, one of those ways is called **_indexing_**. Typically we use brackets **[ ]** when indexing. By doing this we can select or even exclude specific elements. For example we can select a specific column and a certain range within the column. Some examples of symbols to help us select elements include: + - * < less than + - * > greater than + - * \<= less than or equal to + - * >= greater than or equal to + - * == is equal + - * != is not equal + -==== - -[NOTE] -==== -Many programming languages, such as https://www.python.org/[Python] and https://www.learn-c.org/[C], are called "zero-indexed". This means that they begin counting from '0' instead of '1'. Because R is not zero-indexed, we can count like humans normally do. In other words, R starts numbering with row '1'. -==== - -.Helpful Examples -==== -[source,r] ----- -# get all of the data between row "row_index_start" and "row_index_end" -myDF$Distance[row_index_start:row_index_end,] - -# get all of the data from row 3 of myDF -myDF[3,] - -# get all of the data from column 5 of myDF -myDF[,5] - -# get every row of data in the columns between -# myfirstcolumn and mylastcolumn -myDF[,myfirstcolumn:mylastcolumn] - - -# get the first 250 values from column 17 -head(myDF[,17], n=250) - -# retrieves all rows with Distances greater than 100 -myDF$Distance[myDF$Distance > 100] - -# retrieve all flights with Origin equal to "ORD" -myDF$Origin[myDF$Origin == "ORD"] ----- -==== - -.Items to submit -==== -- Code used to solve each sub-question above, and the output from running it. -- The number of flights that departed from Indianapolis in our data, in a markdown cell. -- The number of flights that landed in Indianapolis in our data, in a markdown cell. -- The origin and destination airport from row 894 of the data frame, in a markdown cell. -- The number of flights that have distances less than 200 miles, in a markdown cell. -==== - -=== Question 4 (2 pts) -[upperalpha] -.. Rank the airline companies (in the column `myDF$UniqueCarrier`) according to their popularity, (i.e. according to the number of flights on each airline). (1 pt) -.. Now find the ten airplanes that had the most flights in 1995. List them in order, from most popular to least popular. Do you notice anything unusual about the results? (1 pt) - - -++++ - -++++ - - -Oftentimes we will be dealing with enormous quantities of data, and it just isn't feasible to try and look at the data point-by-point in order to summarize the entire data frame. When we find ourselves in a situation like this, the `table()` function is here to save the day! - -Take a look at https://www.geeksforgeeks.org/create-table-from-dataframe-in-r/[this link] for some examples of how to use the `table()` function in R. Once you have a good understanding of how it works, try and answer the three sub-questions below using the `table()` function. You may need to use some other basic R functions as well. - -[NOTE] -==== -It is useful to use functions in R and see how they behave, and then to take a function of the result, and take a function of that result, etc. For instance, it is common to summarize a vector in a table, and then sort the results, and then take the first few largest or smallest values. This is known as "nesting" functions, and is common throughout programming. - -==== - -.Items to submit -==== -- Code used to solve the sub-questions above, and the output from running it. -- The airline company codes in order of popularity, in a markdown cell. -- The ten airplane tail codes with the most flights in our data, ordered from most flights to least flights, in a markdown cell. -==== - -=== Question 5 (2 pts) -[upperalpha] -.. Using the R built-in function `hist()`, create a histogram of flight distances. Make sure your plot has an appropriate title and labelled axes for full credit. (1 pt) -.. Write 2-3 sentences detailing any patterns you see in your plot and what those patterns tell you about the distance of flights in this dataset. (1 pt) - -++++ - -++++ - -Graphs are a very important tool in analyzing data. By visualizing our data in any of a number of ways, we can discover patterns that may not be as readily apparent by simply looking at tables. As such, they are a vital skill in all data scientists' skillset. - -In this question, we would like you to get comfortable with plotting in R. There are a number of built in tools for basic plotting in this language, but we will focus on histograms here. Using the `Distance` column of our data frame, create a histogram of the distribution of distances for our data. Then, write a few sentences describing your plot, any patterns you see, and what the distribution as a whole looks like. - -[TIP] -==== -https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html[Documentation on R histograms] may help you understand how to complete this question. -==== - -.Items to submit -==== -- Code used to generate your histogram. -- A histogram of the distances of flights in our data with a title and labelled axes. -- 2-3 sentences about the patterns in the data, and what those patterns tell you about the greater data, in a markdown cell. -==== - -=== Submitting your Work -Congratulations, you've finished Project 2! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions. - -.Items to submit -==== -- `firstname-lastname-project02.ipynb`. -- `firstname-lastname-project02.R`. -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc deleted file mode 100644 index b47bb9812..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project03.adoc +++ /dev/null @@ -1,277 +0,0 @@ -= TDM 10100: Project 3 -- Fall 2023 -Inroduction to R part II - -**Motivation:** `data.frames` are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`. - -**Context:** In Project 2 we ran our first R code, learned about vectors and indexing, and explored some basic functions in R. In this project, we will continue to enforce what we've already learned and learn more about how dataframes, formally called `data.frame`, work in R. - -**Scope:** r, data.frames, factors - -.Learning Objectives -**** -- Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- List the differences between lists, vectors, factors, and data.frames, and when to use each. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/craigslist/vehicles.csv` - -== Setting Up -First, let's take a look at all of the data available to students. In order to do this, we are going to use a new function as listed below to list all of the files in the craigslist folder. - -Let's run the below command using the *seminar-r* kernel to view all the files in the folder. - -[source,r] ----- -list.files("/anvil/projects/tdm/data/craigslist") ----- - - -As you can see, we have two different files worth of information from Craigslist. -For this project, we are interested in looking at the `vehicles.csv` file - -++++ - -++++ - - -Before we read in the data, we should check the size of the file to get an idea of how big it is. This is important because if the file is too large, we may need more cores for our project or else our core will 'die'. - -We can check the size of our file (in bytes) using the following command. -[source,r] ----- -file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size ----- - -[TIP] -==== -You can also use `file.info` to see other information about the file. - -*size*- double: File size in bytes. + -isdir- logical: Is the file a directory? + -*mode*- integer of class "octmode". The file permissions, printed in octal, for example 644. + -*mtime, ctime, atime*- integer of class "POSIXct": file modification, ‘last status change’ and last access times. + -*uid*- integer: the user ID of the file's owner. + -*gid*- integer: the group ID of the file's group. + -*uname*- character: uid interpreted as a user name. + -*grname* - character: gid interpreted as a group name. + -(Unknown user and group names will be NA.) -==== - -Now that we have made sure our file isn't too big (1.44 GB), let's read it into a dataframe in the same way that we have done in the previous two projects. - -[TIP] -==== -We recommend using 2 cores for your Jupyter Lab session this week. -==== - -Now we can read in the data and get started with our analysis. -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv") ----- - -== Questions - -=== Question 1 (1 pt) - -++++ - -++++ - -[upperalpha] -.. How many rows and columns does our dataframe have? -.. What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.) -.. 1-2 sentences giving an overall description of our data. - -As we stressed in Project 2, familiarizing yourself with the data you are going to work with is an important first step. For this question, we want to figure out how many rows and columns are in our data along with what the types of data are in our data frame. The hint below contains all of the functions that we need to solve this problem. (We also covered these functions in detail in Project 2, so feel free to reference the previous project if you want more information.) - -When answering sub-question C., consider talking about where the data appears to be taken from, what the data contains, and any important details that immediately stand out to you about the data. - -[TIP] -==== -The `head()`, `dim()`, and `str()` functions could be helpful in answering this question. -==== - -.Items to submit -==== -- The number of rows and columns in our dataframe, in a markdown cell. -- The types of data in our dataframe, in a markdown cell. -- 1-2 sentences summarizing our data. -==== - -=== Question 2 (1 pt) - -++++ - -++++ - -++++ - -++++ - -[upperalpha] -.. Print the number of NA values in the *'year'* column of `myDF`, and the percentage of the total number of rows in `myDF` that this represents. -.. Create a new data frame called `goodyearsDF` with only the rows of `myDF` that have a defined `year` (non `NA` values). Print the `head` of this new data frame. -.. Create a new data frame called `missingyearsDF` with only the rows of `myDF` that *are* missing data in the `year` column. Print the `head` of this new data frame. - -Now that we have a better understanding of the general structure and contents of our data, let's focus on some specific patterns in our data that may make analysis more challenging. - -Often, one of these patterns is missing data. This can come in many forms, such as NA, NaN, NULL, or simply a blank space in one of our dataframes cells. When performing data analysis, it is important to consider missing data and decide how to handle it appropriately. - -In this question, we will look at filtering out rows with missing data. The `R` function `is.na()` indicates `TRUE` or `FALSE` is the analogous data is missing or not missing (respectively). An exclamation mark changes `TRUE` to `FALSE` and changes `FALSE` to `TRUE`. For this reason, `!is.na()` indicates which data are not `NA` values, in other words, which data are not missing. As an example, if we wanted to create a new dataframe with all of the rows that are not missing the latitude values, we could do any of the following equivalent methods: - -[source,r] ----- -goodlatitudeDF <- subset(myDF, !is.na(myDF$lat)) -goodlatitudeDF <- subset(myDF, !is.na(lat)) -goodlatitudeDF <- myDF[!is.na(myDF$lat), ] ----- - -In the second method, the `subset` function knows that we are working with `myDF`, so we do not need to specify that `lat` is the latitude column in the `myDF` data frame, and instead, we can just refer to `lat` and the `subset` function knows that we are referring to a column. - -In the third method, when we write `myDF[ , ]` we put things before the comma that are conditions on the rows, and we put things after the comma that are conditions on the columns. So we are saying that we want rows of `myDF` for which the `lat` values are not `NA`, and we want all of the columns of `myDF`. - -If we compare the sizes of the original data frame and this new data frame, we can see that some rows were removed. - -[source,r] ----- -dim(myDF) ----- - -[source,r] ----- -dim(goodlatitudeDF) ----- - -To answer question 2, we want you to work (instead) with the `year` column, and try the same things that we demonstrated above from the `lat` column. We were simply giving you examples using the `lat` column, so that you have an example about how to deal with missing data in the `year` column. - - -.Items to submit -==== -- The number of NA values in the `year` column of `myDF` and the percentage of the total number of rows in `myDF` that this represents, in a markdown cell. -- A dataframe called `goodyearsDF` containing only the rows in myDF that have a defined `year` (non NA values), and print the `head` of that data frame. -- A dataframe called `missingyearsDF` containing only the rows in myDF that are missing the `year` data, and print the `head` of that data frame. -==== - -=== Question 3 (2 pts) - -++++ - -++++ - -++++ - -++++ - -[IMPORTANT] -==== -Use the `myDF` data.frame for this question. -==== - -[upperalpha] -.. Print the mean price of vehicles by `year` during the last 20 years. -.. Find which `year` of vehicle appears most frequently in our data, and how frequently it occurs. - - -[TIP] -==== -Using the `aggregate` function is one possible way to solve this problem. An example of finding the mean `price` for each `type` of car is shown here: - -[source,r] ----- -aggregate(price ~ type, data = myDF, FUN = mean) ----- -==== - -We want you to (instead) find the mean `price` for cars by `year`. - -[TIP] -==== -Finding the most frequent value in our data can be done using `table`, which we have talked about previously, in conjunction with the `which.max` function. An example of finding the most frequent type of car is shown here: - -[source,r] ----- -which.max(table(myDF$type)) ----- -==== - -Now we want you to (instead) find the year in which the most cars appear in the data set. - -.Items to submit -==== -- The mean price of each year of vehicle for the last 20 years, in a markdown cell. -- The most frequent year in our data, and how frequently it occured. -==== - -=== Question 4 (2 pts) - -++++ - -++++ - -[upperalpha] -.. Among the `region_url` values in the data set, which `region_url` is most popular? -.. What are the three most popular states, in terms of the number of craigslist listings that appear? - -Use the `table`, `sort`, and `tail` commands to find the most popular `region_url` and the most popular three states. - -(These two questions are not related to each other. In other words, when you look for the three states that appear most frequently, they have nothing at all to do with the region_url that you found.) - -.Items to submit -==== -- The most popular `region_url`. -- The three states that appear most frequently. -==== - - -=== Question 5 (2 pts) - -++++ - -++++ - -.. In question 3, we found the average price of vehicles by year. ("Average" and "mean" are two difference words for the very same concept.) Choose at least two different plot types in R, and create two plots that show the average vehicle price by year. -.. Write 3-5 sentences detailing any patterns present in the data along with your personal observations. (i.e. shape, outliers, etc.) - -[NOTE] -==== -Remember, all plots should have a title and appropriate axis labels. Axes should also be scaled appropriately. It is also necessary to explain your plot using a few sentences. -==== - -.Items to submit -==== -- 2 different plots of average price of vehicle by year. -- A 3-5 sentence explanation of any patterns present in the data along with your personal observations. -==== - -=== Submitting your Work -Nice work, you've finished Project 3! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions. - -.Items to submit -==== -- `firstname-lastname-project01.ipynb`. -- `firstname-lastname-project01.R`. -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc deleted file mode 100644 index 7e3294b70..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project04.adoc +++ /dev/null @@ -1,273 +0,0 @@ -= TDM 10100: Project 4 -- Fall 2023 -Introduction to R part III - - -Many data science tools including have powerful ways to index data. - -[NOTE] -==== -R typically has operations that are vectorized and there is little to no need to write loops. + -R typically also uses indexing instead of using an if statement. - -* Sequential statements (one after another) i.e. + -1. print line 45 + -2. print line 15 + - -**if/else statements** - create an order of direction based on a logical condition. + - -if statement example: -[source,r] ----- -x <- 7 -if (x > 0){ -print ("Positive number") -} ----- -else statement example: -[source,r] ----- -x <- -10 -if(x >= 0){ -print("Non-negative number") -} else { -print("Negative number") -} ----- -In `R`, we can classify many numbers all at once: -[source,r] ----- -x <- c(-10,3,1,-6,19,-3,12,-1) -mysigns <- rep("Non-negative number", times=8) -mysigns[x < 0] <- "Negative number" -mysigns ----- - -==== - - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - - -**Context:** As we continue to become more familiar with `R` this project will help reinforce the many ways of indexing data in `R`. - -**Scope:** R, data.frames, indexing. - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - - -Using the *seminar-r* kernel -Lets first see all of the files that are in the `craigslist` folder -[source,r] ----- -list.files("/anvil/projects/tdm/data/craigslist") ----- - -[NOTE] - -==== -Remember: + - -* If we want to see the file size (aka how large) of the CSV. -[source,r] ----- -file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size ----- - -* You can also use 'file.info' to see other information about the file. -==== - -After looking at several of the files we will go ahead and read in the data frame on the Vehicles -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv", stringsAsFactors = TRUE) ----- - -It is important that, each time we look at data, we start by becoming familiar with the contents of the data. + -In past projects we have looked at the head/tail along with the structure and the dimensions of the data. We want to continue this practice. - -This dataset has 25 columns. We are unable to see it all without adjusting the width. We can do this by -[source,r] ----- -options(repr.matrix.max.cols=25, repr.matrix.max.rows=200) ----- -and we also remember (from the previous project) that we can set the output in `R` to look more natural this way: -[source,r] ----- -options(jupyter.rich_display = F) ----- - - -[TIP] -==== -- Use 'head' to look at the first 6 rows -[source,r] - head(myDF) -- Use 'tail' to look at the last 6 rows -[source, r] - tail(myDF) -- Use `str` to check structure -[source, r] - str(myDF) -- Use `dim` to check dimensions -[source, r] - dim(myDF) - -To sort and order a single vector you can use this code: -[source,r] ----- -head(myDF$year[order(myDF$year)]) ----- -You can also use the `sort` function. By default, it sorts in ascending order. If want the order to be descending, use `decreasing = TRUE` as an argument -[source,r] -head(sort(myDF$year, decreasing = TRUE)) -==== - -_**vectorization**_ - -Most of R's functions are vectorized, which means that the function will be applied to all elements of a vector, without needing to loop through the elements one at a time. The most common way to access individual elements is by using the `[]` symbol for indexing. - -[NOTE] -==== -[source,r] ----- -cut(myvector, breaks = c(-Inf,10,50,200,Inf) , labels = c("a","b","c","d")) - -breaks value specified the range of myvector divided into the following intervals: -- (-∞, 10) -- [10, 50) -- [50, 200) -- [200, ∞) - -labels values will be assigned -- Values less than 10: Will be labeled as "a". -- Values in the range [10, 50): Will be labeled as "b". -- Values in the range [50, 200): Will be labeled as "c". -- Values 200 and above: Will be labeled as "d". ----- -==== - -== Questions - -=== Question 1 (1.5 pts) - -++++ - -++++ - - -[upperalpha] -.. How many unique states are there in total? Which five of the states have the most occurrences? -.. How many cars have a price that is greater than or equal to $2000 ? -.. What is the average price of the vehicles in the dataset? - - -=== Question 2 (1.5 pts) - -++++ - -++++ - -[upperalpha] -.. Create a new column `mileage_category` in your data.frame that categorize the vehicle's mileage into different buckets by using the `cut` function on the `odometer` column. -... "Low": [0, 50000) -... "Moderate": [50000, 100000) -... "High": [100000, 150000) -... "Very High": [150000, Inf) - -.. Create a new column called `has_VIN` that flags whether or not the listing Vehicle has a VIN provided. - -.. Create a new column called `description_length` to categorize listings based on the length of their descriptions (in terms of the number of characters). -... "Very Short": [0, 50) -... "Short": [50, 100) -... "Medium": [100, 200) -... "Long": [200, 500) -... "Very Long": [500, Inf) - -[TIP] -==== -You may count number of characters using the `nchar` function -[source,r] -mynchar <- nchar(as.character(myDF$description)) -==== - -[NOTE] -==== -Remember to consider _empty_ values and or `NA` values - -==== - -=== Question 3 (1.5 pts) - -++++ - -++++ - -[upperalpha] -.. Using the `table` function, and the new column `mileage_category` that you created in Question 2, find the number of cars in each of the different mileage categories. -.. Using the `table` function, and the new column `has_VIN` that you created in Question 2, identify how many vehicles have a VIN and how many do not have a VIN. -.. Using the `table` function, and the new column `description_length` that you created in Question 2, identify how many vehicles are in each of the categories of description length. - - -=== Question 4 (1.5 pts) - -++++ - -++++ - -**Preparing for Mapping** -//[arabic] -[upperalpha] -.. Extract all of the data for Texas into a data.frame called `myTexasDF` -.. Identify the most popular state from myDF, and extract all of the data from that state into a data.frame called `popularStateDF` -.. Create a third data.frame called `myFavoriteDF` with the data from a state of your choice - - -=== Question 5 (2 pts) - -++++ - -++++ - -**Mapping** -[upperalpha] -.. Using the R package `leaflet`, make 3 maps of the USA, namely, one map for the data in each of the `data.frames` from question 4. - - -=== Submitting your Work -Well done, you've finished Project 4! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions. - -Project 4 Assignment Checklist -==== -- Code used to solve quesitons 1 to 5 -- All of your code and comments, and Output from running the code in a Jupyter Lab file: - * `firstname-lastname-project04.ipynb`. -- All of your code and comments in an R File: - * `firstname-lastname-project04.R`. -- submit files through Gradescope -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc deleted file mode 100644 index 0a5a6552b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project05.adoc +++ /dev/null @@ -1,167 +0,0 @@ -= TDM 10100: Project 5 -- Fall 2023 - -**Motivation:** `R` differs from other programing languages in that `R` works great with vectorized functions and the _apply_ suite of functions (instead of using loops). - -[NOTE] -==== -The apply family of functions provide an alternative to loops. You can use *`apply()`* and its variants (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()...) to manipulate pieces of data from data.frames, lists, arrays, matrices in a repetitive way. -==== - -**Context:** We will focus in this project on efficient ways of processing data in `R`. - -**Scope:** tapply function - -.Learning Objectives -**** -- Demonstrate the ability to use the `tapply` function. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset in Anvil: - -`/anvil/projects/tdm/data/election/escaped2020sample.txt` - -[NOTE] -==== -A txt and csv file both store information in plain text. Data in *csv* files are almost always separated by commas. In *txt* files, the fields can be separated by commas, semicolons, pipe symbols, tabs, or other separators. - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - -To read in a txt file in which the data is add sep="|" (see code below) -[source,r] ----- - myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|") ----- - -You might want to use 3 cores in this project when you setup your Jupyter Lab session. -==== -=== `Data Understanding` - -The file uses '|' (instead of commas) to separate the data fields. The reason is that one column of data contains full names, which sometimes include commas. - -[source,r] -head(myDF) - -When looking at the head of the data frame, notice that the entries in the `TRANSACTION_DT` column have the month, day, and year all crammed together without any slashes between them. - -=== `lubridate` - -The `lubridate` package can be used to put a column into a date format. In general, data that contains information about dates can sometimes be hard to put into a date format, but the `lubridate` package makes this easier. - -[source,r] ----- -library(lubridate, warn.conflicts = FALSE) -myDF$newdates <-mdy(myDF$TRANSACTION_DT) ----- -A new column `newdates` is created, with the same data as the `TRANSACTION_DT` column but now stored in `date` format. - -Feel free to check out https://raw.githubusercontent.com/rstudio/cheatsheets/master/lubridate.pdf[the official cheatsheet] to learn more about the `lubridate` package. - - -=== `tapply` - -*tapply()* helps us apply functions (for instance: mean, median, minimum, maximum, sum, etc...) to data, one group at a time. The *tapply()* function is most helpful when we need to break data into groups, applying a function to each of the groups of data. - -The `tapply` function takes three inputs: - -Some data to work on; a way to break the data into groups; and a function to apply to each group of data. - -[source, r] -tapply(myDF$TRANSACTION_AMT, myDF$newdates, sum) - -* The `tapply` function applies can `sum` the `myDF$TRANSACTION_AMT` data, grouped according to `myDF$newdates` -* Three inputs for tapply -** `myDF$TRANSACTION_AMT`: the data vector to work on -** `myDF$newdates`: the way to break the data into groups -** `sum`: the function to apply on each piece of data - -== Questions - - -=== Question 1 (1.5 pts) - -++++ - -++++ - -++++ - -++++ - -[upperalpha] -.. Use the `year` function (from the `lubridate` library) on the column `newdates`, to create a new column named `TRANSACTION_YR`. -.. Using `tapply`, add the values in the `TRANSACTION_AMT` column, according to the values in the `TRANSACTION_YR` column. -.. Plot the years on the x-axis and the total amount of the transactions by year on the y-axis. - -=== Question 2 (1.5 pts) - -++++ - -++++ - -[upperalpha] -.. From Question 1, you may notice that the majority of the data collected is found in the years 2019-2020. Please create a new dataframe that only contains data for the dates in the range 01/01/2020-12/31/2020. -.. Using `tapply`, get the sum of the money in the `TRANSACTION_AMT` column, grouped according to the months January through December (in 2020 only). -.. Plot the months on the x-axis and the total amount of the transactions (for each month) on the y-axis. - -=== Question 3 (1.5 pts) - -++++ - -++++ - -Let's go back to using the full set of data across all of the years (from Question 1). We can continue to experiment with the `tapply` function. - -[upperalpha] -.. Please find the donor who gave the most money (altogether) in the whole data set. -.. Find the total amount of money given (altogether) in each state. Then sort the states, according to the total amount of money given altogether. In which 5 states was the most money given? -.. What are the ten zipcodes in which the most money is donated (altogether)? - -=== Question 4 (2 pts) - -++++ - -++++ - -[upperalpha] -.. Using a `barplot` or `dotchart`, plot the total amount of money given in each of the top five states. -.. Using a `barplot` or `dotchart`, plot the total amount of money given in each of the top ten zipcodes. - -=== Question 5 (1.5 pts) - -++++ - -++++ - -[upperalpha] -.. Analyze something that you find interesting about the election data, make a plot to demonstrate your insight, and then explain your finding with a few sentences of explanation. - -Project 05 Assignment Checklist -==== -* Jupyter Lab notebook with your code and comments for the assignment - ** `firstname-lastname-project05.ipynb`. -* R code and comments for the assignment - ** `firstname-lastname-project05.R`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc deleted file mode 100644 index f3d7613a9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project06.adoc +++ /dev/null @@ -1,143 +0,0 @@ -= TDM 10100: Project 6 -- Fall 2023 -Tapply, Tapply, Tapply - -**Motivation:** We want to have fun and get used to the function `tapply` - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -== Questions - -=== Question 1 (1.5 pts) - -++++ - -++++ - -++++ - -++++ - -(We do not need the tapply function for Question 1) - -For this question, please read the dataset - -`/anvil/projects/tdm/data/olympics/athlete_events.csv` - -into a data frame called `myDF` as follows: - -[source, r] - -myDF <- read.csv("/anvil/projects/tdm/data/olympics/athlete_events.csv", stringsAsFactors=TRUE) - -[loweralpha] -.. Use the `table` function to list all Games with occurrences in this data frame -.. Use the `table` function to list all countries participating in the Olympics during the year 1980. (The output should exclude all countries that did not have any athletes in 1980.) -.. Use the `subset` function to create a new data frame containing data related to athletes that attended the Olympics more than one time. - -(Use the original data frame `myDF` as a starting point for each of these three questions. Problems 1a and 1b and 1c are independent of each other. For instance, when you solve question 1c, do not restrict yourself to the year 1980.) - -[TIP] -==== -For question 1c, use `duplicated` to identify duplicated elements, for example: - -[source, r] -vec <- c(3, 2, 6, 5, 1, 1, 1, 6, 5, 6, 4, 3) - -[source, r] -duplicated(vec) - -[source, r] -FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE - -==== - - - -=== Question 2 (1.5 pts) - -Use the `tapply` command to solve each of these questions: - -[loweralpha] -.. What is the average age of the participants from each country? -.. What is Maximum Height by Sport? For your output on this question, please sort the Maximum Heights in decreasing order, and display the first 5 values. - - -=== Question 3 (1 pt) - -++++ - -++++ - -For this question, save the data from the data set - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -into a new data frame called `myDF` as follows: - -[source, r] -myDF <- read.csv("/anvil/projects/tdm/data/death_records/DeathRecords.csv", stringsAsFactors = TRUE) - -It might be helpful to get an overview of the structure of the data frame, by using the `str()` function: - -[source, r] -str(myDF) - -[loweralpha] -.. How many observations (i.e., rows) are given in this dataframe? -.. Change the column `MonthOfDeath` from numbers to months -.. How many people died (altogether) during each month? For instance, group together all of the deaths in January, all of the months in February, etc., so that you can display the total numbers from January to December in a total of 12 output values. - -[TIP] -==== -You may factorize the month names with a specified level order: -[source, r] -month_order <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December") -myDF$MonthOfDeath <- factor(myDF$MonthOfDeath) -levels(myDF$MonthOfDeath) <- month_order -==== - -=== Question 4 (2 pts) - -++++ - -++++ - -[loweralpha] -.. For each race, what is the average age at the time of death? Use the `race` column, which has integer values, and sort your outputs into descending order. -.. Now considering only data for females: for each race, what is the average age at the time of death? Now considering only data for males, we can ask the same question: for each race, what is the average age at the time of death? - -If you want to see the list of race values from the CDC for this data, you can look at page 15 of this pdf file: - -https://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf - -If you want to (this is optional!) you can use the method we used in question 3B to convert integer values into the string values that describe each race. This is not required but you are welcome to do this, if you want to. - -=== Question 5 (2 pts) - -[loweralpha] -.. Using the data set about the Olympic athletes, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset. -.. Using the data set about the death records, create a graph or plot that you find interesting. Write 1-2 sentences about something you found interesting about the data set; explain what you noticed in the dataset. - -Project 06 Assignment Checklist -==== -* Jupyter Lab notebook with your code and comments for the assignment - ** `firstname-lastname-project06.ipynb`. -* R code and comments for the assignment - ** `firstname-lastname-project06.R`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc deleted file mode 100644 index 09274512e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project07.adoc +++ /dev/null @@ -1,193 +0,0 @@ -= TDM 10100: Project 7 -- 2023 - -**Motivation:** A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R. - -**Context:** We've been getting comfortable working with data in within the R environment. Now we are going to expand our tool set with these useful functions, all the while gaining experience and practice wrangling data! - -**Scope:** r, subset, merge, tapply - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Demonstrate how to use tapply to solve data-driven problems. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/icecream/combined/products.csv` --` /anvil/projects/tdm/data/icecream/combined/reviews.csv` -- `/anvil/projects/tdm/data/movies_and_tv/titles.csv` -- `/anvil/projects/tdm/data/movies_and_tv/episodes.csv` -- `/anvil/projects/tdm/data/movies_and_tv/ratings.csv` - -== Questions - -[IMPORTANT] -==== -Please select 3 cores when launching JupyterLab for this project. -==== - -Data can come in a lot of different formats and from a lot of different locations. It is common to have several files that need to be combined together, before analysis is performed. The `merge` function is helpful for this purpose. The way that we merge files is different in each language and data science too. With R, there is a built-in `merge` function that makes things easy! (Of course students in TDM 10100 have not yet learned about SQL databases, but many of you will learn SQL databases someday too. The `merge` function is very similar to the ability to `merge` tables in SQL databases.) - -++++ - -++++ - -[NOTE] -==== -Read the data in using the following code. We used `read.csv` for this purpose in the past. The `fread` function is a _much faster_ and more efficient way to read in data. - -[source,r] ----- -library(data.table) - -products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv") -reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv") -titles <- fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv") -episodes <- fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv") -ratings <- fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv") -==== - -[WARNING] -==== -Please remember to run the `library(data.table)` line, before you use the `fread` function. Otherwise, you will get an error in a pink box in JupyterLab like this: - -Error in fread: could not find function "fread" -==== - -=== Question 1 (1 pt) - -++++ - -++++ - -++++ - -++++ - - -We will use the `products` data.frame for this question. - -[loweralpha] -.. What are all the different ingredients in the first record of `products`? -.. Consider the `rating` column and the `ingredients` column. Consider only the products in which "GUAR GUM" is one of the ingredients. (You will need to use either `grep` or `grepl` or something similar, to find these products. Hint: You should have 85 such products.) List the ratings for these 85 products, in decreasing order. - -Please find out the distribution of ratings for the ice cream which ingredients include "GUAR GUM", display the result in descending order - - -=== Question 2 (1 pt) - -++++ - -++++ - -We will use the `products` and `reviews` data.frames for this question. - -[loweralpha] -.. Use the `brand` and `key` columns from both the `products` data.frame and `reviews` data.frame to `merge` the two data frames. This will give a new data.frame that contains the product details and their associated reviews. - -[TIP] -==== -If you do not specify the `brand` and `key` columns for the `merge`, then you will get an error, because the ingredients function contains characters in the `products` data frame but contains numeric values in the `reviews` data frame. -==== - - -[TIP] -==== -* The `merge` function in `R` allows two data frames to be combined by common columns. This function allows the user to combine data similar to the way `SQL` would using `JOIN`s. https://www.codeproject.com/articles/33052/visual-representation-of-sql-joins[Visual representation of SQL Joins] -* This is also a really great https://www.datasciencemadesimple.com/join-in-r-merge-in-r/[explanation of merge in `R`]. -==== - -=== Question 3 (3 pts) - -++++ - -++++ - -++++ - -++++ - - -We will use the `episodes`, `titles` and `ratings` data.frames for questions 3 through Question 5 - -[loweralpha] -.. Use `merge` (a few times) to create a new data.frame that contains at least the following four columns for **only** the episodes of the show called "Stranger Things". The show itself called "Stranger Things" has a `title_id` of tt4574334. You can find this on IMDB here: https://www.imdb.com/title/tt4574334/ Each episode of Stranger Things has its own `title_id` that contains the information for the specific episode as well. For your output: Show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things. - -- The `primary_title` of the **show itself** -- call it `show_title`. -- The `primary_title` of the **episode** -- call it `episode_title`. -- The `rating` of the **show itself** -- call it `show_rating`. -- The `rating` of the **episode** -- call it `episode_rating`. - -[TIP] -==== -Start by getting a subset of the `episodes` table that contains only information for the show Stranger Things. To do this, you will need to make a subset of the data frame that only has information for Stranger Things show. That way, we aren't working with as much data. -==== - -Make sure to show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things! - -[NOTE] -==== -In the videos, I did not rename the columns. You might want to rename them, because it might help you, but you do not need to rename them. It's up to you. I'm trying to be a little flexible and to provide guidance without being too strict either. -==== - -=== Question 4 (1 pt) - -++++ - -++++ - -For question 4, use the data frame that you built in Question 3. - -[loweralpha] -.. Use regular old indexing to find all episodes of "Stranger Things" with an `episode_rating` less than 8.5 and `season_number` of exactly 3. -.. Repeat the process, but this time use the `subset` function instead. - -Make sure that the dimensions of the data frames that you get in question 4a and 4b are the same sizes! - -=== Question 5 (2 pts) - -++++ - -++++ - -For question 5, use the data frame that you built in Question 3. - -The `subset` function allows you to index data.frame's in a less verbose manner. Read https://the-examples-book.com/programming-languages/R/subset[this]. - -While it maybe appears to be a clean way to subset data, I'd suggest avoiding it over explicit long-form indexing. Read http://adv-r.had.co.nz/Computing-on-the-language.html[this fantastic article by Dr. Hadley Wickham on non-standard evaluation]. Take for example, the following (a bit contrived) example using the dataframe we got in question (3). - -Note: You do not need to write much for your answer. It is OK if you try the example below, and you see that it fails (and it will fail for sure!), and then you say something like, "I will try hard to not use variable names that overlap with other variable names". Or something like that! We simply want to ensure that students are choosing to use good variable names. - -[source,r] ----- -season_number <- 3 -subset(StrangerThingsBigMergedDF, (season_number == season_number) & (rating.y < 8.5)) ----- -[loweralpha] -.. Read that provided article and do your best to explain _why_ `subset` gets a different result than our example that uses regular indexing. - - -Project 07 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project07.ipynb`. -* R code and comments for the assignment - ** `firstname-lastname-project07.R`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc deleted file mode 100644 index c7104c9eb..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project08.adoc +++ /dev/null @@ -1,174 +0,0 @@ -= TDM 10100: Project 8 -- 2023 - -**Motivation:** Functions are an important part of writing efficient code. + -Functions allow us to repeat and reuse code. If you find you using a set of coding steps over and over, a function may be a good way to reduce your lines of code! - -**Context:** We've been learning about and using functions these last few weeks. + -To learn how to write your own functions we need to learn some of the terminology and components. - -**Scope:** r, functions - -.Learning Objectives -**** -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc. -- Read and write basic (csv) data. -- Comprehend what a function is, and the components of a function in R. -**** - -== Dataset(s) - -We will use the same dataset(s) as last week: - -- `/anvil/projects/tdm/data/icecream/combined/products.csv` -- `/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -[IMPORTANT] -==== -Please choose 3 cores when launching the JupyterLab for this project. -==== - -[NOTE] -==== -`fread`- is a fast and efficient way to read in data. - -[source,r] ----- -library(data.table) - -products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv") -reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv") ----- -==== -[WARNING] -==== -Please remember to run the `library(data.table)` line, before you use the `fread` function. Otherwise, you will get an error in a pink box in JupyterLab like this: - -Error in fread: could not find function "fread" -==== - -We will see how to write our own function, so that we can make a repetitive operation easier, by turning it into a single command. + - -We need to take care to name the function something concise but meaningful, so that other users can understand what the function does. + - -Function parameters can also be called formal arguments. - -[NOTE] -==== -A function contains multiple interrelated statements. We can "call" the function, which means that we run all of the statements from the function. + - -Functions can be built-in or can be created by the user (user-defined). + - -.Some examples of built in functions are: - -* `min()`, `max()`, `mean()`, `median()` -* `print()` -* `head()` - - -Syntax of a function -[source, R] ----- -what_you_name_the_function <- function (parameters) { - statement(s) that are executed when the function runs - the last line of the function is the returned value -} ----- -==== - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - -++++ - -++++ - - -To gain better insights into our data, let's make two simple plots. The following are two examples. You can create your own plots. - -[loweralpha] -.. In project 07, you found the different ingredients for the first record in the `products` data frame. We may get all of the ingredients from the `products` data frame, and find the top 10 most frequently used ingredients. Then we can create a bar chart for the distribution of the number of times that each ingredient appears. -.. A line plot to visualize the distribution of the reviews of the products. -.. What information are you gaining from these graphs? -[TIP] -==== -The `table` function can be useful to get the distribution of the number of times that each ingredient appears. - -This is a good website for bar plot examples: https://www.statmethods.net/graphs/bar.html - -This is a good website for line plot examples: http://www.sthda.com/english/wiki/line-plots-r-base-graphs -==== - -Making a `dotchart` for Question 1 is helpful and insightful, as demonstrated in the video. BUT we also want you to see how to make a bar plot and a line plot. Do not worry about the names of the ingredients too much. If only a few names of ingredients appear on the x-axis for Question 1, that is OK wiht us. We just want to show the distribution (in other words, the numbers) of times that items appear. We are less concerned about the item names themselves. - - -=== Question 2 (1 pt) - -For practice, now that you have a basic understanding of how to make a function, we will use that knowledge, applied to our dataset. - -Here are pieces of a function we will use on this dataset; products, reviews and products' rating put them in the correct order + -[source,r] -* merge_results <- merge(products_df, reviews_df, by="key") -* } -* function(products_df, reviews_df, myrating) -* return(products_reviews_results) -* { -* products_reviews_results <- merge_results[merge_results$rating >= myrating, ] -* products_reviews_by_rating <- - - -=== Question 3 (1 pt) - - -Take the above function and add comments explaining what the function does at each step. - -=== Question 4 (2 pts) - -[source,r] ----- -my_selection <- products_reviews_by_rating(products, reviews, 4.5) ----- - -Use the code above, to answer the following question. We want you to use the data frame `my_selection` when solving Question 4. (Do not use the full `products` data frame for Question 4.) - -[loweralpha] -.. How many products are there (altogether) that have rating at least 4.5? (This is supposed to be simple: You can just find the number of rows of the data frame `my_selection`.) - - -[TIP] -==== -The function merged two data sets products and reviews. Both of them have an `ingredients` column, so we need to use the `ingredients` column from `products` by referring to`ingredients.x`. -==== - -=== Question 5 (2 pts) - -For Question 5, go back to the full `products` data frame. (In other words, do not limit yourself to `my_selection` any more.) When you are constructing your function in part a, it should be helpful to review the videos from Question 1. - -[loweralpha] -.. Now create a function that takes 1 ingredient as the input, and finds the number of products that contain that ingredient. -.. Use your function to determine how many products contain SALT as an ingredient. - -(Note: If you test the function with "GUAR GUM", for instance, you will see that there are 85 products with "GUAR GUM" as an ingredient, as we learned in the previous project.) - - -Project 08 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project08.ipynb` -* R code and comments for the assignment - ** `firstname-lastname-project08.R`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc deleted file mode 100644 index 7df6484dc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project09.adoc +++ /dev/null @@ -1,169 +0,0 @@ -= TDM 10100: Project 9 -- 2023 -:page-mathjax: true - -Benford's Law - -**Motivation:** -https://en.wikipedia.org/wiki/Benford%27s_law[Benford's law] has many applications, including its well known use in fraud detection. It also helps detect anomalies in naturally occurring datasets. -[NOTE] -==== -* You may get more information about Benford's law from the following link -https://www.kdnuggets.com/2019/08/benfords-law-data-science.html["What is Benford's Law and Why is it Important for Data Science"] -==== - -**Scope:** 'R' and functions - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -* `/anvil/projects/tdm/data/restaurant/orders.csv` - -[NOTE] -==== -A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tabs. - -[source,r] ----- -myDF <- read.csv("/anvil/projects/tdm/data/restaurant/orders.csv") ----- -==== - -== Questions - -https://www.statisticshowto.com/benfords-law/[Benford's law] (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small. + -It is basically a https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/probability-distribution/[probability distribution] that gives the likelihood of the first digit occurring, in a set of numbers. - -Another way to understand Benford's law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently. - -[NOTE] -==== -A probability distribution helps define what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc. + - -* Basic probability distributions which can be shown on a probability distribution table. -* Binomial distributions, which have “Successes” and “Failures.” -* Normal distributions, sometimes called a Bell Curve. - -Remember that the sum of all the probabilities in a distribution is always 100% or 1 as a decimal. - -This law only works for numbers that are *significand S(x)* which means any number that is set into a standard format. + - -To do this you must - -* Find the first non-zero digit -* Move the decimal point to the right of that digit -* Ignore the sign - -An example would be 9087 and -.9087 both have the *S(x)* as 9.087 - -It can also work to find the second, third and succeeding numbers. It can also find the probability of certain combinations of numbers. + - -Typically this law does not apply to data sets that have a minimum and maximum (restricted). This law does not apply to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and are not naturally occurring numbers. + - -Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford's law. -==== - -++++ - -++++ - -Benford's law is given by the equation below. - -$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$ - -$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$) - -An example the probability of the first digit being a 1 is - -$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$ - -The following is a function implementing Benford's law -[source, r] -benfords_law <- function(d) log10(1+1/d) - -To show Benfords_law in a line plot -[source, r] -digits <-1:9 -bf_val<-benfords_law(digits) -plot(digits, bf_val, xlab = "digits", ylab="probabilities", main="Benfords Law Plot Line") - - -=== Question 1 (1 pt) - -++++ - -++++ - -[loweralpha] - -.. Create a plot (could be a bar plot, line plot, scatter plot, etc., any type of plot is OK) to show Benfords's Law for probabilities of digits from 1 to 9. - -=== Question 2 (1 pt) - -++++ - -++++ - -.. Create a function called `first_digit` that takes an argument `number`, and extracts the first non-zero digit from the number - -=== Question 3 (2 pts) - -++++ - -++++ - -.. Read in the restaurant orders data `/anvil/projects/tdm/data/restaurant/orders.csv` into a dataset named `myDF`. - -.. Create a vector `fd_grand_total` by using `sapply` with your function `first_digit` from question 2 on the `grand_total` column in your `myDF` dataframe - - -=== Question 4 (2 pts) - -++++ - -++++ - -++++ - -++++ - -.. Calculate the actual distribution of digits in `fd_grand_total` -.. Plot the output actual distribution (again, could be a bar plot, line plot, dot plot, etc., anything is OK). Does it look like it follows Benford's law? Explain briefly. - -[TIP] -==== -use `table` to get summary times of digits then divide by `length` of the vector fd_grand_total -==== - -=== Question 5 (2 pts) - -++++ - -++++ - -.. Create a function that will return a new data frame `orders_by_dates` from the `myDF` that looks at the `delivery_date` column to compare with two arguments `start_date` and `end_date`. If the `delivery_date` is in between, then add record to the new data frame. -.. Run the function for a certain period, and display some orders with the `head` function - -[TIP] -`as.Date` will be useful to do conversion in order to compare dates - - -Project 09 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project09.ipynb`. -* R code and comments for the assignment - ** `firstname-lastname-project09.R`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc deleted file mode 100644 index e64de3579..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project10.adoc +++ /dev/null @@ -1,102 +0,0 @@ -= TDM 10100: Project 10 -- 2023 -Creating functions and using `tapply` and `sapply` - -**Motivation:** As we have learned, functions are foundational to more complex programs and behaviors. + -There is an entire programming paradigm based on functions called https://en.wikipedia.org/wiki/Functional_programming[functional programming]. - -**Context:** -We will apply functions to entire vectors of data using `tapply` and `sapply`. We learned how to create functions, and now the next step we will take is to use it on a series of data. - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The project will use the following dataset(s): - -* `/anvil/projects/tdm/data/restaurant/orders.csv` -* `/anvil/projects/tdm/data/restaurant/vendors.csv` - -[NOTE] -==== -The read.csv() function automatically delineates by a comma`,` + -You can use other delimiters by using adding the `sep` argument + -i.e. `read.csv(...sep=';')` + - -You can also load the `data.table` library and use the `fread` function. -==== - - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - - -Please load the datasets into data frames named `orders` and `vendors` - -There are many websites that explain how to use `grep` and `grepl` (the `l` stands for `logical`) to search for patterns. See, for example: https://statisticsglobe.com/grep-grepl-r-function-example - -.. Use the `grepl` function and the `subset` function to make a new data frame from `vendors`, containing only the rows with "Fries" in the column called `vendor_tag_name`. - -.. Now use the `grep` function and row indexing, to make a data frame from `vendors` that (as before) contains only the rows with "Fries" in the column called `vendor_tag_name`. - -.. Verify that your data frames in questions 1a and 1b are the same size. - -=== Question 2 (2 pts) - -++++ - -++++ - -.. In the data frame `vendors`, there are two types of `delivery_charge` values: 0 (which represented free delivery) and 0.7 (which represents non-free delivery). Make a table that shows how many of each type of value there are in the `delivery_charge` column. -.. Please use the `prop.table` function to convert these counts into percentages. - -=== Question 3 (2 pts) - -++++ - -++++ - -.. Consider only the vendors with `vendor_category_id == 2`. Among these vendors, find the percentages of the `delivery_charge` column that are 0 (free delivery) and 0.7 (non-free delivery). -.. Now consider only the vendors with `vendor_category_id == 3`, and again find the percentages of the `delivery_charge` column that are 0 (free delivery) and 0.7 (non-free delivery). - -=== Question 4 (1 pt) - -++++ - -++++ - -.. Solve questions 3a and 3b again, but this time, solve these two questions with one application of the `tapply` command, which provides the answers to both questions. (It is fine to give only the counts here, in question 4a, and convert the counts to percentages in question 4b.) - -.. Now (instead) use an user-defined function inside the `tapply` to convert your answer from counts into percentages. - -=== Question 5 (1 pt) - -++++ - -++++ - -.. Starting with your solution to question 4a, now use the `sapply` command to convert your answer from counts into percentages. Your solution should agree with the percentages that you found in question 4b. - - - -Project 10 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project10.ipynb` -* R code and comments for the assignment - ** `firstname-lastname-project10.R`. - -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc deleted file mode 100644 index 50433c406..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project11.adoc +++ /dev/null @@ -1,92 +0,0 @@ -= TDM 10100: Project 11 -- 2023 - -**Motivation:** Selecting the right tools, understanding a problem and knowing what is available to support you takes practice. + -So far this semester we have learned multiple tools to use in `R` to help solve a problem. This project will be an opportunity for you to choose the tools and decide how to solve the problem presented. - -We will also be looking at `Time Series` data. This is a way to study the change of one or more variables through time. Data visualizations help greatly in looking at Time Series data. - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The project will use the following dataset: - -* `/anvil/projects/tdm/data/restaurant/orders.csv` - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - - -Read in the dataset `/anvil/projects/tdm/data/restaurant/orders.csv` into a data.frame named `orders` - -[loweralpha] -.. Convert the `created_at` column to month, date, year format -.. How many unique years are in the data.frame ? -.. Create a line plot that shows the average number of orders placed per day of the week ( e.g. Monday, Tuesday ...). -.. Write one to two sentences on what you notice in the graph - -=== Question 2 ( 2 pts) - -++++ - -++++ - - -[loweralpha] -.. Identify the top 5 vendors (vendor_id) with the highest number of orders over the years (based on `created_at` for time reference) -.. For these top 5 vendors, determine the average grand_total amount for the orders they received each year -.. Comment on any interesting patterns you observe, regarding the average total amount across these vendors, and how that changed over the years. - -[NOTE] -==== -You can use either `tapply` OR the `aggregate` function to group or summarize data -==== - -=== Question 3 (2 pts) - -++++ - -++++ - - - -.. Using the `created_at` field, try to find out how many orders are placed after 5 pm, and how many orders are placed before 5 pm? -.. Create a bar chart that compares the number of orders placed after 5 pm with the number of orders before 5 pm, for each day of the week - -[NOTE] -==== -You can use the library `ggplot2` for this question. - -You may get more information about ggplot2 from here: https://ggplot2.tidyverse.org -==== - -=== Question 4 (2 pts) - -Looking at the data, is there something that you find interesting? -Create 3 new graphs, and explain what you see, and why you chose each specific type of plot. - - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project11.ipynb` -* R code and comments for the assignment - ** `firstname-lastname-project11.R`. - -* Submit files through Gradescope -==== - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc deleted file mode 100644 index 16afd55e3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project12.adoc +++ /dev/null @@ -1,86 +0,0 @@ -= TDM 10100: Project 12 -- 20223 - -**Motivation:** -In the previous project we manipulated dates, this project we are going to continue to work with dates. -Working with dates in `R` can require more attention than working with other object classes. These packages will help simplify some of the common tasks related to date data. + - -Dates and times can be complicated. For instance, not every year has 365 days. Dates are difficult because they have to accommodate for the Earth's rotation and orbit around the sun. We need to handle timezones, daylight savings, etc. -If suffices to say that, when focusing on dates and date-times in R, the simpler the better. - -.Learning Objectives -**** -- Read and write basic (csv) data. -- Explain and demonstrate: positional, named, and logical indexing. -- Utilize apply functions in order to solve a data-driven problem. -- Gain proficiency using split, merge, and subset. -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Incorporate legends using legend(). -- Demonstrate the ability to customize a plot (color, shape/linetype). -- Work with dates in a variety of ways. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The project will use the following dataset: - -* `/anvil/projects/tdm/data/restaurant/orders.csv` - -== Questions - -Go ahead and use the `fread` function from the `data.table` library, to read in the dataset to a data frame called `orders`. - -=== Question 1 (2 pts) - -++++ - -++++ - - -[loweralpha] -. Use the `substr` function to get (only) the month-and-year of each date in the `created_at` column. How many times does each month-and-year pair occur? You may find more information about the `substr` function here: https://www.digitalocean.com/community/tutorials/substring-function-in-r#[R substring] -. Now (instead) use the `month` function and the `year` function on the `created_at` column, and make sure that your results agree with the results from 1a. -. Finally, use the `format` function to extract the month-and-year pairs from the `created_at` column, and make sure that your results (again!) agree with the results from 1a. - - -=== Question 2 (2 pts) - -++++ - -++++ - -[loweralpha] -. Which `customer_id` placed the largest number of orders altogether? (Each row of the data set represents exactly one order.) -. For the `customer_id` that you found in question 2a, either use the `subset` function or use indexing to find the month-and-year pair in which that customer placed the most orders. - -=== Question 3 (2 pts) - -[loweralpha] -. There are 5 types of payments in the `payment_mode` column. How many times are each of these 5 types of payments used in the data set? -. If we focus on the `customer_id` found in question 2a, which type of payment does that customer prefer? How many times did that customer use each of the 5 types of payments? - -=== Question 4 (2 pts) - -[loweralpha] -. Use the `subset` function to make a data frame called `ordersJan2020` that contains only the orders from January 2020. -. Create a plot using the `ordersJan2020` data that shows the sum of the `grand_total` values for each of the 7 days of the week. - - - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb` -* R code and comments for the assignment - ** `firstname-lastname-project12.R`. -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc deleted file mode 100644 index ddca85d74..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project13.adoc +++ /dev/null @@ -1,84 +0,0 @@ -= TDM 10100: Project 13 -- 2023 - -**Motivation:** This semester we took a deep dive into `R` and its packages. Let's take a second to pat ourselves on the back for surviving a long semester and review what we have learned! - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The project will use the following dataset: - -* `/anvil/projects/tdm/data/icecream/combined/products.csv` - -== Questions - -=== Questions 1 (2 pts) - -For question 1, read the dataset into a data.frame called `orders` - -[loweralpha] -.. Create a plot that shows, for each `brand` of ice cream, the total number of `rating_count` (in other words, the `sum` of those `rating_count` values) for each `brand` of icecream. There are 4 brands, so your solution should have 4 values altogether. - -[TIP] -==== -- It might be worthwhile to make a dotchart. - -==== - -Before solving Question 2, please build a data frame called `bigDF` from these three files - -`/anvil/projects/tdm/data/icecream/bj/reviews.csv` - -`/anvil/projects/tdm/data/icecream/breyers/reviews.csv` - -`/anvil/projects/tdm/data/icecream/talenti/reviews.csv` - -using this code: - -[source,bash] ----- -mybrands <- c("bj", "breyers", "talenti") -myfiles <- paste0("/anvil/projects/tdm/data/icecream/", mybrands, "/reviews.csv") -bigDF <- do.call(rbind, lapply(myfiles, fread)) ----- - -Use this data frame `bigDF` to answer Questions 2, 3, and 4: - - -=== Question 2 (2 pts) - -[loweralpha] -.. In which month-and-year pair were the most reviews given? (There is one review per line of this data frame `bigDF`. -.. Make a plot that shows, for each year, the average number of stars in that year. - -=== Question 3 (2 pts) - -[loweralpha] -.. Which key has the lowest average number of stars? -.. There is one entry in which the text review has more than 2500 characters! Print the text of this review. - -=== Question 4 (2 pts) - -[loweralpha] -.. Consider all of the authors of the reviews. Which author wrote the most reviews altogether? (Note: there are many blank authors, and there are a lot of Anonymous authors, but please ignore blank authors and Anonymous authors in this question.) -.. Considering the 43 reviews written by the author that you found in question 4a, this author is usually happy and gives high ratings. BUT this author gave one review that only had 1 star. Print the text of that 1 star review from the author you found in question 4a. - - - - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* R code and comments for the assignment - ** `firstname-lastname-project13.R`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc deleted file mode 100644 index d5317a6f8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-project14.adoc +++ /dev/null @@ -1,53 +0,0 @@ -= TDM 10100: Project 14 -- Fall 2023 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** R, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, or package. - - -=== Question 3 (1 pt) - -.. Reflecting on data visualization questions that you have done, which one do you consider most appealing? Please provide an example from one question that you completed. You may refer to the question, and screenshot your graph. - -=== Question 4 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 5 (1 pt) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 6 (2 pts) - -.. Please identify 3 skills or topics related to the R language that you want to learn. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc deleted file mode 100644 index 7fd078360..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/10100/10100-2023-projects.adoc +++ /dev/null @@ -1,45 +0,0 @@ -= TDM 10100 - -xref:fall2023/logistics/office_hours_101.adoc[[.custom_button]#TDM 101 Office Hours#] -xref:fall2023/logistics/101_TAs.adoc[[.custom_button]#TDM 101 TAs#] -xref:fall2023/logistics/syllabus.adoc[[.custom_button]#Syllabus#] - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$10100-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2023/tdm10100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2023/tdm10100/home] - -== Syllabus - -See xref:fall2023/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc deleted file mode 100644 index d475e7357..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project01.adoc +++ /dev/null @@ -1,388 +0,0 @@ -= TDM 20100: Project 1 -- 2023 - -**Motivation:** It’s been a long summer! Last year, you got some exposure to both R and Python. This semester, we will venture away from R and Python, and focus on UNIX utilities like `sort`, `awk`, `grep`, and `sed`. While Python and R are extremely powerful tools that can solve many problems — they aren’t always the best tool for the job. UNIX utilities can be an incredibly efficient way to solve problems that would be much less efficient using R or Python. In addition, there will be a variety of projects where we explore SQL using `sqlite3` and `MySQL/MariaDB`. - -We will start slowly, however, by remembering how to work with Jupyter Lab. In this project we will become re-familiarized with our development environment, review some, and prepare for the rest of the semester. - -**Context:** This is the first project of the semester! We will start with some review, and set the "scene" to learn about some powerful UNIX utilities, and SQL the rest of the semester. - -**Scope:** Jupyter Lab, R, Python, Anvil, markdown - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Review R and Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` - -== Setting Up to Work - -++++ - -++++ - - -This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.) - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory. - -[NOTE] -==== -As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough. -==== - -[NOTE] -==== -Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -seminar:: -The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -seminar-r:: -The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you. - - -The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work). - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -[TIP] -==== -Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -==== - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts) -.. How many cores and how much memory (in GB) does your personal computer have? (0.5 pts) - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (often called RAM, or Random Access Memory). Use the information on the provided webpages to calculate how many cores and how much memory is available _in total_ for Anvil's "sub-cluster A". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[TIP] -==== -Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here]. - -Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings. -==== - -.Items to submit -==== -- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A. -- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. Using Python, what is the name of the node on Anvil you are running on? -.. Using Bash, what is the name of the node on Anvil you are running on? -.. Using R, what is the name of the node on Anvil you are running on? - -Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[source,r] ----- -%%R - -system("hostname", intern=TRUE) ----- - -[source,bash] ----- -%%bash - -hostname ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about as different languages can handle output differently, but it is good to take note of. - -.Items to submit -==== -- Code used to solve this problem, along with the output of running that code. -==== - -=== Question 3 (1 pt) -[upperalpha] -.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question. - -++++ - -++++ - - -[TIP] -==== -Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know! -==== - -In this course, we will be using Jupyter Lab with multiple different languages. Often, we will center a project around a specific language and choose the kernel for that langauge appropriately, but occasionally we may need to run a language in a kernel other than the one it is primarily built for. The solution to this is using line magic! - -Line magic tells our code interpreter that we are using a language other than the default for our kernel (i.e. The `seminar` kernel we are currently using is expecting Python code, but we can tell it to expect R code instead.) - -Line magic works by having the very first line in a code cell formatted like so: - -`%%language` - -Where `language` is the language we want to use. For example, if we wanted to run R code in our `seminar` kernel, we would use the following line magic: - -`%%R` - -Practice running the following examples, which include line magic where needed. - -python:: -[source,python] ----- -import pandas as pd -df = pd.read_csv('/anvil/projects/tdm/data/flights/subset/1991.csv') ----- - -[source,python] ----- -df[df["Month"]==12].head() # get all flights in December ----- - -SQL:: -[source, ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, sql] ----- -%%sql - --- get all episodes called "Finale" -SELECT * -FROM episodes AS e -INNER JOIN titles AS t -ON t.title_id = e.episode_title_id -WHERE t.primary_title = 'Finale' -LIMIT 5; ----- - -bash:: -[source,bash] ----- -%%bash - -names="John Doe;Bill Withers;Arthur Morgan;Mary Jane;Rick Ross;John Marston" -echo $names | cut -d ';' -f 3 -echo $names | cut -d ';' -f 6 ----- - - -[NOTE] -==== -In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell). - -In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know! - -To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code from the examples above, and the outputs produced by running that code. -==== - -=== Question 4 (1 pt) -[upperalpha] -.. How many code cells are there in the default template? (0.5 pts) -.. How many markdown cells are there in the default template? (0.5 pts) - -As we mentioned in the `Setting Up` section of this project, there are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain markdown text which you can render into nicely formatted text). How many cells of each type are there in this template by default? - -.Items to submit -==== -- The number of cells of each type in the default template, in a markdown cell. -==== - -=== Question 5 (1 pt) -[upperalpha] -.. Create an unordered list of at least 3 of your favorite interests. Italicize at least one of these. (0.5 pts) -.. Create an ordered list of at least 3 of your favorite interests. Embolden at least one of these, and make at least one other item formatted like `code`. (0.5 pts) - -Markdown is well worth learning about. You may already be familiar with it, but more practice never hurts, and there are plenty of niche tricks you may not know! - -[TIP] -==== -For those new to Markdown, please review this https://www.markdownguide.org/cheat-sheet/[cheat sheet]! -==== - -Create a Markdown cell in your notebook. For this question, we would like you to create two lists as follows. - -Firstly, create an _unordered_ list of at least 3 of your favorite interests (some examples could include sports, animals, music, etc.). Within this list, _italicize_ at least one item. - -Secondly, create an _ordered_ list that orders the items in your previous list, from most favorite to least favorite. In this list, **embolden** at least one item, and make at least one other item formatted like `code`. - -[TIP] -==== -Don't forget to "run" your markdown cells by clicking the small "Play" button in the notebook menu. Running a markdown cell will render the text in the cell with all of the formatting you specified. Your unordered lists will be bulleted and your ordered lists will be numbered. -==== - -.Items to submit -==== -- Unordered list of 3+ items with at least one _italicized_ item. -- Ordered list of 3+ items with at least one **emboldened** item and at least one `code` item. -==== - -=== Question 6 (1 pt) -[upperalpha] -.. Write your own LinkedIn "About" section using Markdown that includes a header, body text that you would be comfortable adding to your LinkedIn account, and at least one link using Markdown syntax. - -Browse https://www.linkedin.com and read some profiles. Pay special attention to accounts with an "About" section. Write your own personal "About" section using Markdown in a new Markdown cell, with the following features: - -- A header for this section (your choice of size) that says "About". -- The body text of your personal "About" section that you would feel comfortable uploading to LinkedIn. -- In the body text of your "About" section, _for the sake of learning markdown_, include at least 1 link using Markdown's link syntax. - -[TIP] -==== -A Markdown header is a line of text at the top of a Markdown cell that begins with one or more `#`. -==== - -.Items to submit -==== -- A markdown cell containing your LinkedIn "About" entry, as described above. -==== - -=== Question 7 (2 pts) -[upperalpha] -- Create a function in Python to print the median, mean, and standard deviation of the `DepDelay` column in our dataset, along with the shape of the `/anvil/projects/tdm/data/flights/subset/1991.csv` dataset overall. (1 pt) -- Create an R function to print the median, mean, and standard deviation of the `DepDelay` column in our dataset, along with the shape of the `/anvil/projects/tdm/data/flights/subset/1991.csv` dataset overall. (1 pt) - -This question may seem a bit difficult at first, but these are all concepts we covered in the 100 level of the class! Remember, your previous projects are still on Anvil (assuming you haven't deleted/overwritten them) and can be a great resource to look back on. You may also look back at the previous 100 level project instructions on The Examples Book. - -Using `pandas` in Python, create a function that takes a dataframe as input and prints the shape of the dataframe along with the mean, median, and standard deviation of the `DepDelay` column of that dataframe. Print your results formatted as follows: - -``` -MyDF Summary Statistics --- -Shape: (rows, columns) -Mean: 123.456 -Median: 123.456 -Standard Deviation: 123.456 ---------------------------- -``` - -Then, recreate your function but this time using R. Remember that you will need to use the `%%R` line magic at the top of your cell to tell the kernel that you are using R code. You should not need to import any libraries in order to do this. - -[TIP] -==== -The `R` equivalent of `print()` is `cat()`. -==== - -[NOTE] -==== -It is not important that your function output is formatted the exact same as ours. What is important, however, is that any printing that occurs in your code is neat and well formatted. If it is hard for the graders to read, you may lose points. Do your best and we will always work together to improve things. -==== - -Make sure your code is complete, and well-commented. Double check that both functions return the same values as a built-in sanity check for your code. - -.Items to submit -==== -- Python Function to print median, mean, and standard deviation of the `DepDelay` column of our dataset, along with the shape of the dataset. -- R Function to print median, mean, and standard deviation of the `DepDelay` column of our dataset, along with the shape of the dataset. -==== - -=== Submitting your Work - -++++ - -++++ - - -Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope. - -.Items to submit -==== -- `firstname-lastname-project01.ipynb`. -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.sql`. -- `firstname-lastname-project01.sh`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc deleted file mode 100644 index 5c819f74c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project02.adoc +++ /dev/null @@ -1,395 +0,0 @@ -= TDM 20100: Project 2 -- 2023 - -**Motivation:** The ability to navigate a shell, like `bash`, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook. - -**Context:** At this point in time, our Jupyter Lab system, using https://ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab. - -**Scope:** bash, Jupyter Lab - -.Learning Objectives -**** -- Distinguish differences in `/home`, `/anvil/scratch`, and `/anvil/projects/tdm`. -- Navigating UNIX via a terminal: `ls`, `pwd`, `cd`, `.`, `..`, `~`, etc. -- Analyzing file in a UNIX filesystem: `wc`, `du`, `cat`, `head`, `tail`, etc. -- Creating and destroying files and folder in UNIX: `scp`, `rm`, `touch`, `cp`, `mv`, `mkdir`, `rmdir`, etc. -- Use `man` to read and learn about UNIX utilities. -- Run `bash` commands from within Jupyter Lab. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -[IMPORTANT] -==== -If you are not a `bash` user and you use an alternative shell like `zsh` or `tcsh`, you will want to switch to `bash` for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the `%%bash` magic will use `/bin/bash` rather than your default shell, so you will not need to do anything. -==== - -[NOTE] -==== -While it is not _super_ common for us to push a lot of external reading at you (other than the occasional blog post or article), https://learning.oreilly.com/library/view/learning-the-unix/0596002610[this] is an excellent, and _very_ short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8. -==== - -=== Question 1 (1 pt) -[upperalpha] -.. A list of length >=2 of modifications you made to your environment, in a markdown cell. - -Let's ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from https://ondemand.anvil.rcac.purdue.edu. - -Open your settings by navigating to menu:Settings[Advanced Settings Editor]. - -Explore the settings, and make at least 2 modifications to your environment, and list what you've changed. - -Here are some settings Kevin likes: - -- menu:Theme[Selected Theme > JupyterLab Dark] -- menu:Document Manager[Autosave Interval > 30] -- menu:File Browser[Show hidden files > true] -- menu:Notebook[Line Wrap > on] -- menu:Notebook[Show Line Numbers > true] -- menu:Notebook[Shut down kernel > true] - -Dr. Ward does not like to customize his own environment, but he _does_ use the Emacs key bindings. Jackson _loves_ to customize his own environment, but he _despises_ Emacs bindings. Feel free to choose whatever is most comfortable to you. - -- menu:Settings[Text Editor Key Map > emacs] - -[IMPORTANT] -==== -Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc. -==== - -.Items to submit -==== -- List (using a markdown cell) of the modifications you made to your environment. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. In a markdown cell, what is the absolute path of your home directory in Jupyter Labs? - -In the previous project's question 3, we used a tool called `awk` to parse through a dataset. This was an example of running bash code using the `seminar` kernel. Aside from use the `%%bash` magic from the previous project, there are 2 other straightforward ways to run bash code from within Jupyter Lab. - -The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example, using `ls` can be done like so: - -[source,ipython] ----- -!ls - -import pandas as pd -myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}) -myDF.head() ----- - -[NOTE] -==== -This does _not_ require you to have other, Python code in the cell. The following is perfectly valid. - -[source,ipython] ----- -!ls -!ls -la /anvil/projects/tdm/ ----- - -With that being said, using this method, each line _must_ start with an exclamation point. -==== - -The second method is to open up a new terminal session. To do this, go to menu:File[New > Terminal]. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, `man`. - -[source,bash] ----- -# man is short for manual, to quit, press "q" -# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down. -man man ----- - -Great! Now that you've learned 2 new ways to run `bash` code from within Jupyter Lab, please answer the following question: - -What is the _absolute path_ of the default directory of your `bash` shell? When we say "default directory" we mean the folder that you are "in" when you first run `bash` code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory. - -**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd] - -.Items to submit -==== -- `bash` code to print the full filepath of the default directory (home directory), and the output of running that code. Ex: Kevin's is: `/home/x-kamstut` and Dr Ward's is: `/home/x-mdw`. -==== - -=== Question 3 (1 pt) -[upperalpha] -.. `bash` to navigate to `/anvil/projects/tdm/data` -.. `bash` to print the current working directory -.. `bash` to list the files in the current working directory -.. `bash` to list _all_ of the files in `/anvil/projects/tdm/data/movies_and_tv`, _including_ hidden files -.. `bash` to return to your home directory -.. `bash` to confirm that you are back in your home directory (print your current working directory) - -It is a critical skill to be able to navigate a UNIX-like operating system, and you will very likely need to use UNIX or Linux (or something similar) at some point in your career. For this question, write `bash` code to perform the following tasks in order. In your final submission, please ensure that all of your steps and their outputs are included. - -[WARNING] -==== -For the sake of consistency, please run your `bash` code using the `%%bash` magic. This ensures that we are all using the correct shell (there are many shells), and that your work is displayed properly for your grader. -==== - -. Navigate to the directory containing the datasets used in this course: `/anvil/projects/tdm/data`. -. Print the current working directory. Is the result what you expected? -. Output the `$PWD` variable, using the `echo` command. -. List the files within the current working directory (excluding subfiles). -. Without navigating out of `/anvil/projects/tdm/data`, list _all_ of the files within the the `movies_and_tv` directory, _including_ hidden files. -. Return to your home directory. -. Write a command to confirm that you are back in your home directory. - -[NOTE] -==== -`/` is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains _every_ other folder in the computer. `/home` is a folder within the root directory. `/home/x-kamstut` is the _absolute path_ of Kevin's home directory. -==== - -**Relevant topics:** - -https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/ls[ls] - -.Items to submit -==== -- `bash` to navigate to `/anvil/projects/tdm/data`, and print the current working directory -- `bash` to list the primary files in the current working directory -- `bash` to list _all_ of the files in `/anvil/projects/tdm/data/movies_and_tv`, _including_ hidden files -- `bash` to return to your home directory and confirm you are there. -==== - -=== Question 4 (1 pt) -[upperalpha] -.. Write a single command to navigate to the modulefiles directory: `/anvil/projects/tdm/opt/lmod`, then confirm that you are in the correct directory using the `echo` command. (0.5 pts) -.. Write a single command to navigate back to your home directory, using _relative_ paths, then confirm that you are in the correct directory using the 'echo' command. (0.5 pts) - -When running the `ls` command (specifically the `ls` command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: `.` and `..`. `.` represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the `/anvil/projects/tdm/data` directory, the `.` refers to the `/anvil/projects/tdm/data` directory. If you are running the following bash command, the `.` is redundant and refers to the `/anvil/projects/tdm/data/yelp` directory. - -[source,bash] ----- -ls -la /anvil/projects/tdm/data/yelp/. ----- - -`..` represents the parent directory, relative to the rest of the path. For example, if you are in the `/anvil/projects/tdm/data` directory, the `..` refers to the parent directory, `/anvil/projects/tdm`. - -Any path that contains either `.` or `..` is called a _relative path_ (because it is _relative_ to the directory you are currently in). Any path that contains the entire path, starting from the root directory, `/`, is called an _absolute path_. - -For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included. - -. Write a single command to navigate to our modulefiles directory: `/anvil/projects/tdm/opt/lmod`. -. Confirm that you are in the correct directory using the `echo` command. -. Write a single command to navigate back to your home directory, however, rather than using `cd`, `cd ~`, or `cd $HOME` without the path argument, use `cd` and a _relative_ path. -. Confirm that you are in the corrrect directory using the `echo` command. - -[NOTE] -==== -If you don't fully understand the text above, _please_ take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career. You can also come to seminar or visit TA office hours to get assistance. We love to talk to students, and everyone benefits when we all collaborate. -==== - -**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/other-topics/special-symbols[special-symbols] - -.Items to submit -==== -- Single command to navigate to the modulefiles directory. -- Single command to navigate back to your home directory using _relative_ paths. -- Commands confirming your navigation steps were successful. -==== - - -=== Question 5 (1 pt) -[upperalpha] -.. Navigate to your scratch directory using environment variables. -.. Run `tokei` on your home directory (use an environment variable). -.. Output the first 5 lines and last 5 lines of `/anvil/datasets/training/anvil-101/batch-test/batch-test-README`. Make sure it is clear which lines are the first 5 and which are the last 5. -.. Output the number of lines in `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` -.. Output the size, in bytes, of `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` -.. Output the location of the `tokei` program we used earlier. - -[NOTE] -==== -`$SCRATCH` and `$USER` are referred to as _environment variables_. You can see what they are by typing `echo $SCRATCH` and `echo $USER`. `$SCRATCH` contains the absolute path to your scratch directory, and `$USER` contains the username of the current user. We will learn more about these in the rest of this question. -==== - -Your `$HOME` directory is your default directory. You can navigate to your `$HOME` directory using any of the following commands. - -[source,bash] ----- -cd -cd ~ -cd $HOME -cd /home/$USER ----- - -This is typically where you will work, and where you will store your work (for instance, your completed projects). - -The `/anvil/projects/tdm` space is a directory created for The Data Mine. It holds our datasets (in the `data` directory), as well as data for many of our corporate partners projects. - -There exists 1 more important location on each cluster, `scratch`. Your `scratch` directory is located at `/anvil/scratch/$USER`, or, even shorter, `$SCRATCH`. `scratch` is meant for use with _really_ large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command. - -[source,bash] ----- -myquota ----- - -[NOTE] -==== -Doug Crabill is the one of the Data Mine's extraordinarily wise computer wizards, and he has kindly collated a variety of useful scripts to be publicly available to students. These can be found in `/anvil/projects/tdm/bin`. Feel free to explore this directory and learn about these scripts in your free time. -==== - -One of the helpful scripts we have at our disposal is `tokei`, a code analysis tool. We can use this tool to quickly determine the language makeup of a project. An in-depth explanation of tokei can be found https://github.com/XAMPPRocky/tokei[here], but for now, you can use it like so: - -[source,bash] ----- -tokei /path/to/project ----- - -Sometimes, you may want to know what the first or last few lines of your file look like. https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/head[head] and https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/tail[tail] can help us do that. Take a look at their documentation to learn more. - -One goal of our programs is often to be size-efficient. If we have a very simple program, but it is enormous, it may not be worth our time to download and use. The https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/wc[wc] tool can help us determine the size of our file. Take a look at its documentation for more information. - -[CAUTION] -==== -Be careful. We want the size of the script, not the disk usage. -==== - -Finally, we often may know that a program exists, but we don't know where it is. https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/which[which] can help us find the location of a program. Take a look at its documentation for more information, and use it to solve the last part of this question. - -[TIP] -==== -Commands often have _options_. _Options_ are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages. - -[source,bash] ----- -man wc ----- - -You can see -m, -l, and -w are all options for `wc`. Then, to test the options out, you can try the following examples. - -[source,bash] ----- -# using the default wc command. "/anvil/projects/tdm/data/flights/1987.csv" is the first "argument" given to the command. -wc /anvil/projects/tdm/data/flights/1987.csv - -# to count the lines, use the -l option -wc -l /anvil/projects/tdm/data/flights/1987.csv - -# to count the words, use the -w option -wc -w /anvil/projects/tdm/data/flights/1987.csv - -# you can combine options as well -wc -w -l /anvil/projects/tdm/data/flights/1987.csv - -# some people like to use a single "tack" `-` -wc -wl /anvil/projects/tdm/data/flights/1987.csv - -# order doesn't matter -wc -lw /anvil/projects/tdm/data/flights/1987.csv ----- -==== - -**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/pwd[pwd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/head[head], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/tail[tail], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/wc[wc], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/which[which], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/type[type] - -.Items to submit -==== -- Navigate to your scratch directory, and run tokei on your home directory, using only environment variables. -- Print out the first 5 lines and last 5 lines of the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file. -- Print out the number of lines in the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file. -- Print out the size in bytes of the `/anvil/datasets/training/anvil-101/batch-test/batch-test-README` file. -- Print out the location of the `tokei` program we used earlier in this question. -==== - -=== Question 6 (2 pts) -[upperalpha] -.. Navigate to your scratch directory. -.. Copy the file `/anvil/projects/tdm/data/movies_and_tv/imdb.db` to your current working directory. -.. Create a new directory called `movies_and_tv` in your current working directory. -.. Move the file, `imdb.db`, from your scratch directory to the newly created `movies_and_tv` directory (inside of scratch). -.. Use `touch` to create a new, empty file called `im_empty.txt` in your scratch directory. -.. Remove the directory, `movies_and_tv`, from your scratch directory, including _all_ of the contents. -.. Remove the file, `im_empty.txt`, from your scratch directory. - -Now that we know how to navigate a UNIX-like system, let's learn how to create, move, and delete files and folders. For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included. - -First, let's review the `cp` command. `cp` is short for copy, and it is used to copy files and folders. The syntax is as follows: - -[source,bash] ----- -cp ----- - -Next let's take a look at the `rm` command. `rm` is short for remove, and it is used to remove files and folders. The syntax is as follows: - -[source,bash] ----- -rm -rm -r ----- - -[WARNING] -==== -Be **very** careful when using this command. If you use `rm` on a file or directory, you very likely will not be able to recover it. There is no "taking it out of the recycling bin". It is gone. Forever. If you are unsure, please ask for help. -==== - -Finally, let's learn about `touch` and `mkdir`. `touch` is used to create new files, whereas `mkdir` creates new directories. The basic syntax for these is as follows: - -[source,bash] ----- -touch -mkdir ----- - -With that, you should have all of the knowledge you need to work on this question! Remember, each command has its own unique flags and syntax. When in doubt, use `man` to learn more about a command and its flags before using it haphazardly. - -**Relevant topics:** https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cp[cp], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/rm[rm], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/touch[touch], https://the-examples-book.com/starter-guides/tools-and-standards/unix/standard-utilities/cd[cd] - -=== Question 7 (1 pt) -[upperalpha] -.. Use terminal autocompletion to print the contents of `hello_there.txt`, and put the contents in a markdown cell in your notebook. - -[IMPORTANT] -==== -This question should be performed by opening a terminal window. menu:File[New > Terminal]. Enter the result/content in a markdown cell in your notebook. -==== - -Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a _really_ useful feature, that you may not know is there unless you are told! - -Here is the way it works, in the most common case -- using `cd`. Have a destination in mind, for example `/anvil/projects/tdm/data/flights/`. Type `cd /anvil/`, and press tab. You should be presented with a small list of options -- the folders in the `anvil` directory. Type `p`, then press tab, and it will complete the word for you. Type `t`, then press tab. Finally, press tab, but this time, press tab repeatedly until you've selected `data`. You can then continue to type and press tab as needed. - -Below is an image of the absolute path of a file in Anvil. Use `cat` and tab completion to print the contents of that file. - -image::figure03.webp[Tab completion, width=792, height=250, loading=lazy, title="Tab completion"] - -.Items to submit -==== -- The contents of the file, `hello_there.txt`, in a markdown cell in your notebook. -==== - -=== Submitting your Work -Congratulations, you've finished Project 2! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions. - -.Items to submit -==== -- `firstname-lastname-project02.ipynb`. -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project03.adoc deleted file mode 100644 index c715ca7b6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project03.adoc +++ /dev/null @@ -1,206 +0,0 @@ -= TDM 20100: Project 3 -- 2023 - -**Motivation:** The need to search files and datasets based on text is common during various parts of the data wrangling process. As an example, `grep` is a powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated. https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/[(Even professionals can make critical mistakes.)] With that being said, learning some of the basics will come in handy, regardless of the language in which you are using regular expressions. - - -[NOTE] -==== -Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools, you should definitely take the time to learn regular expressions. -==== - -**Context:** We've just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, `grep`, and experiment with regular expressions using `grep`, R, and later on, Python. - -**Scope:** `grep`, regular expression basics, utilizing regular expression tools in R and Python - -.Learning Objectives -**** -- Use `grep` to search for patterns within a dataset. -- Use `cut` to section off and slice up data from the command line. -- Use `wc` to count the number of lines of input. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the files in this directory: - -- `/anvil/projects/tdm/data/consumer_complaints/` - -and, in particular, several questions will focus on the data in this file: - -- `/anvil/projects/tdm/data/consumer_complaints/processed.csv` - - -[NOTE] -==== -`grep` stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate `grep`, we will be using it with textual data. - -Let's assume for a second that we _didn't_ provide you with the location of this projects dataset, and you didn't know the name of the file either. With all of that being said, you _do_ know that it is the only dataset with the text `"That's the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently."` in it. You may use 'grep' command to search for the dataset. (Make sure that the single quotation mark in your quote is not a curly quote; you might have to manually re-type it.) - -You can start in the `/anvil/projects/tdm/data` directory to reduce the amount of text being searched. In addition, use a wildcard (*) to reduce the directories we search to only directories that start with a `con` inside the `/anvil/projects/tdm/data` directory such as -[source,bash] -/anvil/projects/tdm/data/con* - -Just know that you'd _eventually_ find the file without using the wildcard, but we don't want to waste your time. -==== -[NOTE] -==== -Use `man` to read about some of the options with `grep`. For example, you'll want to search _recursively_ through the entire contents of the directories starting with a `con` with option -R or -r. - -[source, bash] - -grep -Rin 'fraudy fraudulent fraud' /anvil/projects/tdm/data/con* - -- R: This flag tells grep to search recursively, meaning it will traverse through the specified directory and all of its subdirectories, looking for the pattern in every file it encounters. -- i: This flag makes the search case-insensitive. So, "FRAUDY", "Fraudy", and "fraudy" would all match. -- n: With this flag, grep will also display the line numbers in the files where the matches are found. -- 'fraudy fraudulent fraud': This is the pattern grep is looking for. It will search for the exact phrase "fraudy fraudulent fraud" in files. -- /anvil/projects/tdm/data/con*: This is the path where grep should start its search. Specifically, it tells grep to look in the /anvil/projects/tdm/data/ directory and search in all files and directories starting with con. -==== -[TIP] -==== -When you search for this sentence in the file, make sure that you type the single quote in `"That's"` so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique _part_ of the sentence that will likely not exist in another file. -==== - -++++ - -++++ - - -=== Question 1 (1 pt) - -[upperalpha] -.. Write a `grep` command that finds the dataset, which contains text "朝阳区" in all directories that start with `air` inside the `/anvil/projects/tdm/data` directory. As with the example given above, you search should be case-insensitive, and your needs to display the line numbers for the location of the text. - - -=== Question 2 (1.5 pts) - -++++ - -++++ - - -[upperalpha] -.. Use the `head` command to print out the first line _only_ from the file `/anvil/projects/tdm/data/consumer_complaints/processed.csv`. - -+ - -[TIP] -==== -Using the `head` command, we can (in general) quickly print out the first _n_ lines of a file. A csv file typically has a header row to explain what data each column holds. - -[source, bash] - -head -n numberoflines filename -==== -//[arabic] -+ -[start=b] - -.. Print out first 5 lines from 3 columns, namely: `Date Received`, `Issue` and `Company response to consumer` from the file `/anvil/projects/tdm/data/consumer_complaints/processed.csv` -+ -[TIP] -==== -Use the `cat` command to view all file contents, the `head` to control the row, and the `cut` command to select columns. - -[source, bash] - -cat filename | head -n rowNumbers | cut -d 'delimiterhere' -f field1,field2,... - -==== -//[arabic] -+ -[start=c] -.. For the _single_ line where we heard about the `"That's the sort of fraudy fraudulent fraud"`, print out these 4 columns: `Date Received`, `Issue`, `Consumer complaint narrative`, and `Company response to consumer`. (Make sure that the single quotation mark in your quote is not a curly quote; you might have to manually re-type it.) - -[TIP] -==== -Use `cat`, `head`, `tail`, and `cut` commands to isolate the 4 columns and the _single_ line - -You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the `n` option from `grep`. That will tell you the line number, which can then be used with `head` and `tail` to isolate the single line. - -[source, bash] - -cat filename | grep 'patternhere' | cut -d 'delimiterhere' -f field1,field2,field3,field4 -==== - - -=== Question 3 (2 pts) - -++++ - -++++ - -++++ - -++++ - -//[arabic] -[upperalpha] - -.. From the file `/anvil/projects/tdm/data/consumer_complaints/processed.csv`, use a one line statement to create a _new_ dataset called `midwest.csv` that has the following requirments: - - * it will only contains the data for these five states: - Indiana (IN), Ohio (OH), Illinois (IL), Wisconsin (WI), and Michigan (MI) - * it will only the contain these five columns: `Date Received`, `Issue`, `Consumer complaint narrative`, `Company response to consumer`, and `state` -+ -[TIP] -==== -- Be careful that you don't accidentally get lines with a word like "AGILE" in them (IL is the state code of Illinois and is present in the word "AGILE"). -- Use '>' redirection operator to create the new file, e.g., -[source, bash] -createthefile > midwest.csv - -==== -//[arabic] -[start=b] -.. Please describe how many rows of data are in the new file, and find the size of the new file in megabytes - -[TIP] -==== -- Use `wc` to count rows -- Use `cut` to isolate _just_ the data we ask for. For example, _just_ print the number of rows, and _just_ print the value (in Mb) of the size of the file: - -[source, bash] - -cut -d 'delimiterhere' -f positionofrequestedfield -==== - -.output like this ----- -520953 ----- - -.output not like this ----- -520953 /home/x-nzhou1/midwest.csv ----- - -=== Question 4 (1.5 pt) - -//[arabic] -[upperalpha] -.. Use grep command to get information from the _new_ data set 'midwest.csv' to find the number of rows that contain one (or more) of the following words (the search is case-insensitive): "improper", "struggling", or "incorrect". - - -=== Question 5 (2 pts) - -[upperalpha] -.. In the file `/anvil/projects/tdm/data/consumer_complaints/processed.csv`, which date appears the most in the `Date received` column? -.. In the file `/anvil/projects/tdm/data/consumer_complaints/processed.csv`, for each category of `Product`, how many times does that type product appear in the data set? - -Project 03 Assignment Checklist -==== -- Code used to solve quesiton 1 to 5 -- Output from running the code -- Copy the code and outputs to a new Python File - * `firstname-lastname-project03.ipynb`. -- Submit files through gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project04.adoc deleted file mode 100644 index 09534c0e5..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project04.adoc +++ /dev/null @@ -1,171 +0,0 @@ -= TDM 20100: Project 4 -- 2023 - - -**Motivation:** Becoming comfortable piping commands in a chain, and getting used to navigating files in a terminal, are important skills for every data scientist to learn. These skills will give you the ability to quickly understand and manipulate files in a way which is not possible using tools like Microsoft Office, Google Sheets, etc. You may find that these UNIX tools are really useful for analyzing data. - -**Context:** We've been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called 'piping'. - -**Scope:** grep, regular expression basics, UNIX utilities, redirection, piping - -.Learning Objectives -**** -- Use `cut` to section off and slice up data from the command line. -- Use `|` piping to string UNIX commands together. -- Use `sort` and it's options to sort data in different ways. -- Use `head` to isolate n lines of output. -- Use `wc` to summarize the number of lines in a file or in output. -- Use `uniq` to filter out non-unique lines. -- Use `grep` to search files effectively. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/stackoverflow/unprocessed/*` -- `/anvil/projects/tdm/data/stackoverflow/processed/*` -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -[WARNING] -==== -For this project, please submit a `.sh` text file with all of you `bash` code written inside of it. This should be submitted _in addition to_ your notebook (the `.ipynb` file). Failing to submit the accompanying `.sh` file may result and points being removed from your final submission. Thanks! -==== - - -=== Question 1 (2 pts) - -++++ - -++++ - - -[NOTE] -==== -The following statement will check how many columns are found in this csv file: - -[source,bash] -cat /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv | tr ',' '\n' | wc -l - -BUT this file is a little bit strange, because it only has 1 large line. (In fact, there is no line ending at the end of the line, so `wc` says that the file has 0 lines!) - -[source,bash] -wc /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv - -In the question below, we can to turn the commas in this file into newline characters, and then count the number of words in the file. - -* In a csv file, the number of columns is usually 1 larger than the number of commas. -* `cat` prints the file -* `head -n10` prints the first 10 lines of the file -* `tr ',' '\n'` replaces all commas with the newline character `\n` -* `wc` counts the number of lines, words, and characters. -==== - -[upperalpha] -.. Please use commands `head`, `tr` and `wc` to find out how many words occur in the first 10 lines of the file `/anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv` - - -=== Question 2 (2 pts) - -++++ - -++++ - -[NOTE] -==== -As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using some other UNIX tools that are more useful on semi-clean datasets, e.g. `awk` - -The following statement outputs the number of columns in each of the first 10 lines of the file: -[source, bash] -head /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print NF}' - -* `awk` can be used for simple data manipulation tasks that involve pattern matching, field extraction, arithmetic, and string operations - - ** -F";": Set the field separator to ;. - ** {print NF}: Print the number of fields in each line. - -We are just starting to introduce `awk`, a little utility that allows us to analyze each line of the data. The main part of an awk command runs once on each line of the data set. - -==== -[upperalpha] - -.. Let's turn our attention to a different file. Use `awk` to find out how many columns appear in the fifth row of the file `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -=== Question 3 (2 pts) - -++++ - -++++ - -++++ - -++++ - -[NOTE] -==== -With appropriate commands, the following statement use finds the 5 largest orders, in terms of the number of `Bottles Sold` -[source, bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f21 | sort -nr | head -n 5 - -* `cat` is used to display the entire content of the file -* `cut` is an UNIX command used to remove or "cut out" certain sections of each line from a file or the output of a command. -** -d ';' specifies that the delimiter (or separator) between fields is the semicolon (;). -** -f21 tells cut to only retrieve the 21st field/column (`Bottles Sold` column) based on the semicolon delimiter. So, after this command, you'll get only the `Bottles Sold` values from the 21st column of the file `iowa_liquor_sales_cleaner.txt`. -* `sort` arranges lines of text alphabetically or numerically. -** -n means "numeric sort", so the values are treated as numbers and not as strings. -** -r means "reverse", so the output will be in descending order -* `head` is used to display only first 5 lines - -==== -[upperalpha] -.. Use UNIX commands to find out what are the 6 highest 'state bottle retail' prices from the file `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` and what are the analogous item descriptions for these 6 items? (Some are repeated, and that is OK.) - -[TIP] -==== -* column 16 is for 'item description' and column 20 is for 'state bottle retail' price -==== - -=== Question 4 (2 pts) - -++++ - -++++ - -[NOTE] -==== -Here is another example. We can pipeline `cat`, `cut`,`sort` and `uniq` to display how many times each unique bottle volume appears in the file -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f18 | sort -n | uniq -c - -* column 18 (-f18) is for 'Bottle Volume (ml)' -* `uniq` with the `-c` option, finds the number of occurrences of each outcome -==== -[upperalpha] - -.. Please find out how many times each bottle volume appears in the file - -[TIP] -==== -* column 18 indicates the bottle volume -==== - - - - -Project 04 Assignment Checklist -==== -* Jupyter Lab notebook with your code and comments for the assignment - ** `firstname-lastname-project04.ipynb`. -* A `.sh` text file with all of your `bash` code and comments written inside of it - ** bash code and comments used to solve questions 1 through 4 -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project05.adoc deleted file mode 100644 index a618b249b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project05.adoc +++ /dev/null @@ -1,180 +0,0 @@ -= TDM 20100: Project 5 -- 2023 - -**Motivation:** `awk` is a utility designed for text processing. While Python and R definitely have their place in the data science world, awk is a handy way to process data with just one line of analysis. - -**Context:** `awk` is a powerful tool that can be used to perform a variety of the tasks for which we previously used other UNIX utilities. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner, in pipelines of tools. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pipe the output data from one tool to become the input data for the next tool. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv` - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -[NOTE] -==== -While the UNIX tools we've used up to this point are very useful, `awk` enables many new capabilities, and can even replace major functionality of other tools. - -`awk` is a text-processing tool in Unix. It scans and processes text based on patterns. `awk` is a versatile tool, ideally used for tasks in data that is organized in columns. It provides, pattern matching, field-based calculations, and file formatting tasks, all performed efficiently from the command line. - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -Here is an example to use `awk` to find the number of people in each salary range in the data set `/anvil/projects/tdm/data/stackoverflow/processed/2011.csv` - -[source,bash] -cat /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print $16}' | sort | uniq -c | sort -n - -The `cat` command prints the entire file, - -but instead of outputing the entire file, we send it to the awk command. - -In the awk command, we use the semicolor as a separator, and we print the 16th field, which contains the salary information. - -Then we sort this data, so that all entries that are the same are next to each other. - -Then we find how many values of each type occur. - -Finally, we sort the responses according to how many times that they occur. - -To make this example more interesting, we can simply add the 14th field as well, and then we are classifying responses according the salary range and according to the person's favorite operating system. - -[source,bash] -cat /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print $16, $14}' | sort | uniq -c | sort -n - -==== - -Here is another example: - -[NOTE] -==== -The prices of the purchases for this file are in the 19th field: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{print $19}' | head - -We can add all of the prices as follows: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{myprices += $19} END{print myprices}' - -There are 283 million dollars of sales altogether! - -We can find the amount of sales of BOURBON like this: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | grep "BOURBON" | awk -F";" '{myprices += $19} END{print myprices}' - -or like this: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{if ($0 ~ /BOURBON/) {myprices += $19}} END{print myprices}' - -Either way, bourbon accounts for 24 million dollars of the sales. - -Champagne sales, on the other hand, are only 10206 dollars together: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | awk -F";" '{if ($0 ~ /CHAMPAGNE/) {myprices += $19}} END{print myprices}' - -or equivalently: - -[source,bash] -cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | grep "CHAMPAGNE" | awk -F";" '{myprices += $19} END{print myprices}' - -==== - - -== Questions - -=== Question 1 (1 pt) - -++++ - -++++ - -[upperalpha] -.. What is the total cost of purchases with `WHISKIES` in the title? -.. What is the total cost of all purchases from `CEDAR RAPIDS` (not just `WHISKIES`; consider all purchases) - -=== Question 2 (2 pts) - -++++ - -++++ - -[upperalpha] -.. What `Store Name` had the largest number of purchases (not the largest total cost, but the largest number of purchases; please consider each line to be 1 purchase) -.. Using the `Store Name` identified Question 2A, what was the total cost of all purchases from this `Store Name`? - -=== Question 3 (2 pt) - -++++ - -++++ - -[upperalpha] -.. Please compute the total volume (in liters) of all purchases sold in the file `iowa_liquor_sales_cleaner.txt` -.. Please compute the total volume (in liters) of `VODKA 80 PROOF` sold in the file `iowa_liquor_sales_cleaner.txt` - -=== Question 4 (2 pts) - -++++ - -++++ - -[upperalpha] -.. When looking at which location has the largest numbers of purchases, if we use the address (instead of the store name), we should include the `Address`, `City`, and `Zip Code`. Using these three variables (together), what location has the largest number of purchases? -.. Does your answer to Question 4A agree with your answer to Question 2A? How do you know? (Please explain why, and/or use some analysis to justify your answer.) - -=== Question 5 (1 pt) - -++++ - -++++ - -[upperalpha] -.. `awk` is powerful, and this liquor dataset is pretty interesting! We haven't covered everything `awk` (and we won't). Look at the dataset and ask yourself an interesting question about the data. Use `awk` to solve your problem (or, at least, get you closer to answering the question). Optionally: You can explore various stackoverflow questions about `awk` and `awk` guides online. Try to incorporate an `awk` function you haven't used, or a `awk` trick you haven't seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new. - -Please be sure to put a brief explanation about your work in Question 5 using awk to study something interesting that *YOU FOUND* in the data in Question 5. - -[NOTE] -==== -You do not need to limit yourself to _just_ use `awk`, but try to do as much using just `awk` as you are able. -==== - -Project 05 Assignment Checklist -==== -* Jupyter Lab notebook with your code and comments for the assignment - ** `firstname-lastname-project05.ipynb`. -* A `.sh` text file with all of your `bash` code and comments written inside of it - ** bash code and comments used to solve questions 1 through 5 -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project06.adoc deleted file mode 100644 index a7e62b0eb..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project06.adoc +++ /dev/null @@ -1,156 +0,0 @@ -= TDM 20100: Project 6 -- 2023 - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the second of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, UNIX utilities - -.Learning Objectives -**** -- Use awk to process and manipulate textual data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/restaurant/orders.csv` -- `/anvil/projects/tdm/data/whin/observations.csv` - -== Questions - -=== Question 1 (1 pt) - -++++ - -++++ - -++++ - -++++ - -[loweralpha] - -.. How many columns and rows are in the following dataset: `/anvil/projects/tdm/data/restaurant/orders.csv`. - -The following is example output - -.output ----- -rows: 12345 -columns: 12345 ----- - -=== Question 2 (1 pt) - -++++ - -++++ - -++++ - -++++ - -[loweralpha] - -.. Please list all possible values of "Location Type" in the file - -`/anvil/projects/tdm/data/restaurant/orders.csv` - -and how many times each value occurs. - -Your output should give each location type, followed by the numbers of orders for that Location Type. Use `awk` to answer this question. Make sure to format the output as follows: - -.output ----- -Location Type Number of Orders --------------- ---------------- -AAA 12345 -bb 99999 ----- - -=== Question 3 (2 pts) - -++++ - -++++ - -[loweralpha] - -.. What is the year range for the data in the dataset: - -`/anvil/projects/tdm/data/restaurant/orders.csv`? - - - -=== Question 4 (2 pts) - -++++ - -++++ - - -[loweralpha] -.. What is the sum of the order amounts for each year in the data set - -`/anvil/projects/tdm/data/restaurant/orders.csv`? - -Pease make sure the output format is the following: - -.output ----- -Year Summary of Orders in dollars -2019 $PUT THE TOTAL DOLLAR AMOUNT HERE ----- - -NOTE: It is totally OK if you put the dollar amount in scientific notation (that will probably happen by default when you add up the dollar amounts, because there were a lot of restaurant orders! - -ANOTHER NOTE: There is only 1 year (namely, 2019) in this data set. - -=== Question 5 (2 pts) - -++++ - -++++ - - -[loweralpha] -.. Please extract both the years and months for the file: - -`/anvil/projects/tdm/data/whin/observations.csv` - -and how many times each year-and-month pair occurs. - -Your output should give each year-and-month value, followed by the numbers of times that this year-and-month appears. Use `awk` to answer this question. You likely will need to use awk twice in a pipeline. Make sure to format the output as follows: - -.output ----- -Month and Year Number of Occurrences --------------- --------------------- -2020-06 12345 -2020-07 99999 ----- - - - -Project 06 Assignment Checklist -==== -* Jupyter notebook with your code, comments and output for questions 1 to 5 - ** `firstname-lastname-project06.ipynb`. -* A `.sh` text file with all of your `bash` code and comments written inside of it - ** bash code and comments used to solve questions 1 through 5 - -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project07.adoc deleted file mode 100644 index 574ab7fb9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project07.adoc +++ /dev/null @@ -1,152 +0,0 @@ -= TDM 20100: Project 7 -- 2023 -:page-mathjax: true - -**Motivation:** `awk` is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like `awk`. - -**Context:** This is the third of three projects where we introduce `awk`. `awk` is a powerful tool that can be used to perform a variety of the tasks that we've previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner. - -**Scope:** awk, awk arrays, UNIX utilities - -.Learning Objectives -**** -- Use awk arrays to efficiently store sets of data -- Use awk and functions to process and manipulate data. -- Use piping and redirection within the terminal to pass around data between utilities. -**** -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` -- `/anvil/projects/tdm/data/election/escaped2020sample.txt` -- `/anvil/projects/tdm/data/flights/1990.csv` - -[NOTE] -==== -In `awk`, arrays are associative, meaning you can store data with key-value pairs. This makes it very efficient to manage data, especially for large files. We can manage to index our data in easier ways than in other programming languages. -Awk arrays are versatile and powerful tools: you can index them using strings and numbers. -==== - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - -=== Question 1 (2 pts) - -Consider the dataset `iowa_liquor_sales_cleaner.txt`. - -In Project 05's Question 2, you were asked to find out which Store Name had the most purchases, and once you obtained the store name, you calculated the total cost from that store. Now let's use `awk associative arrays` to list each of the top 10 stores' total cost of purchases (per store). - -(In Project 5 we used column 19, which was the `State Bottle Cost`. This time, instead, let's focus on column 22.) - -For this question, please use column 22, which is the `Sale (Dollars)`. - -[loweralpha] -.. Find the total cost of the purchases for each store. For the output, show only the top 10 stores, in terms of the total cost of the purchases. List each of these top 10 stores and the total cost of the purchases of each. - -[HINT] -==== -Use column 4, which is the Store Name, for the index in the associate array. -Use column 22, which is the Sale (Dollars), for the values to add up. -When you print your results, it would help to print `mytotal[i]` first, and then print `i` second, so that you can sort your results numerically, using `sort -n`. -==== - -[HINT] -==== -You might want to use `sort -g` or `sort -gr` so that your numbers are sorted with the scientific notation allowed. The `-g` specifically tells bash that some of your numbers are in scientific notation. -==== - -=== Question 2 (2 pts) - -Let's look at the dataset `escaped2020sample.txt`. - -[loweralpha] -.. This dataset contains the CITY(column 9), STATE (column 10), and TRANSACTION_AMT (column 15) for each donation. Please calculate the total transaction amounts for each city/state pair from this dataset (for instance, West Lafayette, IN). For your answer, list the top 10 city/state pairs with the largest total transaction amounts. Please use awk associative arrays for this question. - -[WARNING] -==== -The values in columns 9, 10, and 15, all have double quotes on them. This will make it hard to add the values in column 15. You might want to add this short sequence into your pipeline, to remove the double quotes: - -`sed 's/"//g'` (the "s" is for substitute and the "g" is for global, and we are removing the double quotes and replacing it with nothing!) - -Compare these two lines, for instance: - -`cat /anvil/projects/tdm/data/election/escaped2020sample.txt | awk -F"|" '{print $9, $10, $15}' | head` - -versus this line: - -`cat /anvil/projects/tdm/data/election/escaped2020sample.txt | sed 's/"//g' | awk -F"|" '{print $9, $10, $15}' | head` - -I hope that helps! By removing the double quotes, you will be able to add the values in the 15th column. (If you have double quotes present, then awk will not like to add up the values in the 15th column, so you need to remove the double quotes.) -==== - -[TIP] -==== -* Since there have are some cities that share the same city names but different states, we need to combine the city and state into a city/state pair, to differentiate the locations. This is demonstrated with the year,month pair, in the fourth introductory video, at the start of the project, and is demonstrated again with the year,month,day triple near the end of the fourth video. You can do this in a similar way, using the city,state pair just like you used the year,month pair. -==== - -=== Question 3 (2 pts) - -Now let us take a look at the dataset `1990.csv`. - -[loweralpha] -.. You may have noticed that the "FlightDate" column (6th column) contains dates formatted as "1990-01-31". Please write an awk command to extract the year and month (not the day) from this column and then reformat them as (for instance) "01/1990". For your output, print each of the twelve months from 1990, and the number of flights that occur during each of those months. - -[TIP] -==== -You do NOT need associative arrays for this question. You just need to use `cut` or `awk` to extract the year,month pair. -==== - -=== Question 4 (2 pts) - -[loweralpha] -.. Use `awk` to create a new dataset from `1990.csv` called `1990_flight_info.csv`. This new file should include the following columns: the flight_month_year(MM/YYYY) just the same as you created in question 3, and also the total number of flights during that month_year. Order the results according to the number of flights in the month_year, from smallest to largest. The header of this file called `1990_flight_info.csv` should look like: - -.columns ----- -flight_month_year;total_number_of_flights ----- - -[TIP] -==== -You do NOT need associative arrays for this question. We are just learning how to store the results of a bash pipeline into a file, using the `>` symbol. -==== - -[TIP] -==== -Use `>` to _redirect_. You can output from the `awk` command to a new file with this operator. If you were to replace `>` by `>>` it would _append_ instead of _replace_. In other words, if you use a single `>` it will first erase the output file before adding the results of the `awk` command to the file. If you use `>>`, it will append the results. -==== - -[NOTE] -==== -Make sure to submit the file `1990_flight_info.csv` when you upload your files to Gradescope. -==== - -Project 07 Assignment Checklist -==== -* Jupyter notebook with your code, comments and output for questions 1 to 4 - ** `firstname-lastname-project07.ipynb`. -* A `.sh` text file with all of your `bash` code and comments written inside of it - ** bash code and comments used to solve questions 1 through 4 -* The output file from question 4, called: 1990_flight_info.csv -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project08.adoc deleted file mode 100644 index d67c6240e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project08.adoc +++ /dev/null @@ -1,173 +0,0 @@ -= TDM 20100: Project 8 -- 2023 - -**Motivation:** Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. Learning SQL is well worth your time! - -**Context:** There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we've spent much of this semester in the terminal, we will start in the terminal using SQLite. - -**Scope:** SQL, SQlite - -.Learning Objectives -**** -- Explain the advantages and disadvantages of using a database -- Describe basic database concepts like: RDBMS, tables, fields, query, join, clause. -- Basic clauses: select, limit, where, from, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -For this project, we will be using the `lahman` sqlite database. This database contains the data in the directory - -- `/anvil/projects/tdm/data/lahman` - -You may get some more `lahman` database information from this youtube video http://youtube.com/watch?v=tS_-oTbsDzs -[2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"] - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - -Get started by taking a look at the available tables in the Lahman database. - -[loweralpha] -.. What tables are available in the Lahman database? - -[TIP] -==== -You'll want to prepend `%%sql` to the top of the cell -- it should be the very first line of the cell (no comments or _anything_ else before it). - -[source,ipython] ----- -%%sql - --- Query here ----- - -In SQLite, you can show the tables using the following query: - -[source, sql] ----- -.tables ----- - -Unfortunately, SQLite-specific functions can't be run in a Jupyter Lab cell like that. Instead, we need to use a different query. - -[source, sql] ----- -SELECT tbl_name FROM sqlite_master WHERE type='table'; ----- -==== - -=== Question 2 (2 pts) - -++++ - -++++ - -[loweralpha] -.. It's always a good idea to learn what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table(s). Write and run queries that return the first 5 rows of data for the `people` table, the `batting` table, the `fielding` table, the `managers` table, and 2 more tables of your choice (you can pick any 2 more tables to consider). - -.. To get a better idea of the size of the data, you can use the `count` clause to get the number of rows in each table. Write and run 6 queries that return the number of rows in each of these 6 tables. - -[TIP] -==== -Run each query in a separate cell, and remember to limit the query to return only 5 rows each. - -You can use the `limit` clause to limit the number of rows returned. -==== - -=== Question 3 (1 pt) - -++++ - -++++ - -Okay, let's dig into the `people` table a little bit. Run the following query. - -[source, sql] ----- -SELECT * FROM people LIMIT 5; ----- - -As you can see, every row has a `playerID` for each player. It is a unique identifier or key for the `people` table. In Question 2, you checked several tables, so you might already notice that a few tables contain this `playerID` such as in table `batting`, `fielding`, `managers` etc. The `playerID` relates data from those tables to the specific player. -[loweralpha] -.. Let us find information about a famous baseball player named `Mike Trout` from the `people` table. - -[TIP] -==== -The `WHERE` clause can be used to filter the results of a query. -Use table fields `nameLast` and `nameFirst` for the query. -==== - - -=== Question 4 (1 pt) - -++++ - -++++ - -Now you understand what the `playerID` means _inside_ the database. - -[source, sql] ----- -SELECT * FROM batting where playerID ='troutmi01' ----- - -The query will output all fields of data for Mike Trout from table `batting` -[loweralpha] -.. First use Mike Trout's `playerID` (from Question 3) to find the number of his home runs in each season. -.. Now make a second query that only displays Mike Trout's data for the year `2022` but includes the playerID, teamID, and number of home runs. - -[TIP] -==== -The `HR` field contains the number of home runs. -==== - -=== Question 5 (2 pts) - -++++ - -++++ - -Now pick a different baseball player (your choice!) and find that baseball player's information in the database. - -[loweralpha] - -.. For this baseball player, please find the baseball player's information from the `people` table -.. Please use the `playerID` to get this player's number of home runs in the year 2022. -.. Please join the `people` table and the `batting` table, to display information from the fields of `nameLast`, `nameFirst`, `weight`, `height`, `birthYear`, and number of home runs in the year 2022, along with the `teamID`, and `yearID`. - -[TIP] -==== -You may refer to the following website for SQLite table join examples https://www.sqlitetutorial.net/sqlite-join/ - -Use `yearID` from the `batting` table for the Year. -==== - -Project 08 Assignment Checklist -==== -* Jupyter notebook with your code, comments and output for questions 1 to 5 - ** `firstname-lastname-project08.ipynb` -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project09.adoc deleted file mode 100644 index 41dada817..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project09.adoc +++ /dev/null @@ -1,127 +0,0 @@ -= TDM 20100: Project 9 -- 2023 - -**Motivation:** Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data. - -**Context:** We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like `AVG`, `COUNT`, and `MAX`. - -**Scope:** SQL, sqlite - -.Learning Objectives -**** -- Describe basic database concepts like: RDBMs, tables, fields, query, clause,etc. -- Basic clauses: select, order by, limit, desc, asc, count, where, from, group by, etc. -- Utilize SQL functions like max, avg, sum, count, cast,round etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -For this project, we will be using the `lahman` sqlite database. This database contains the data in the directory - -- `/anvil/projects/tdm/data/lahman` - -You may get some more `lahman` database information from this youtube video http://youtube.com/watch?v=tS_-oTbsDzs[2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"] - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -[NOTE] -In previous projects, we used `awk` to parse through and summarize data. Using `SQL` gives us more ways to analyze and summarize data. - -[IMPORTANT] -==== -Make sure all queries limit output to only 100 rows. You may refer to https://www.sqlitetutorial.net/sqlite-limit/[SQLite-Limit Syntax] -If you want the headers to be more descriptive, you can use aliases.You may refer to https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm[SQLite-aliases Syntax] - -==== - -=== Question 1 (2 pts) - -++++ - -++++ - -.. Write a query to find out who won the 2022 World Baseball Series from table `seriespost`? -.. For this champion team, please find out the home runs (hr) rate and runs batted in (rbi) rate in year 2022 from table `batting`. Round the rates to 2 decimals. You may get rates by - * hr_rate = sum of home runs / sum of hits - * rbi_rate = sum of runs batted in / sum of hits - -[TIP] -==== -Use the `sum` aggregate function to calculate the totals, and division to figure out the percentages(rates). - -`cast` is useful to convert integer to real number to do calculation, e.g. -[source, sql] -select cast (HR AS REAL) from batting - -Try to do the calculation without `cast`. What do you get? -Also, `round` is useful to round to a decimal. -==== - -=== Question 2 (2 pts) - -++++ - -++++ - -.. For the champion team from question 1, please write a query that counts the number of RBIs for each athlete in the champion team during year 2022, using the `batting` table. Display your output in ascending order. -.. Run the query again, but this time, display the output in descending order. -.. Which athlete has the highest RBIs in this question? Please provide the player's `playerID`, along with their first name and last name, from the `people` table - -[TIP] -==== -* Use `group by` to group for each athlete -* Use `order by` to sort the output -==== - - -=== Question 3 (2 pts) - -++++ - -++++ - -.. Write a query that finds how many times the athlete from question 2 attended All Star Games. -.. Write a query to find out who is the athlete that attended most All Star Games in the entire data set. - - -=== Question 4 (1 pt) - -++++ - -++++ - -.. Write a query that gets the average `salary` for each athlete in the database. Display your output in descending order. Limit the output to 100 rows (i.e., to the top 100 salaries). - -=== Question 5 (1 pt) - -++++ - -++++ - -Now create your own query about a topic that you are interested in. Use at least one of the aggregation functions, such as `min`, `max`, `count`, or `sum`. Be sure to use `group by` and display the results in order with `order by`. - -Project 09 Assignment Checklist -==== -* Jupyter notebook with your code, comments and output for questions 1 to 5 - ** `firstname-lastname-project09.ipynb` -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project10.adoc deleted file mode 100644 index a3e6ad7d6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project10.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= TDM 20100: Project 10 -- 2023 - -**Motivation:** Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like `MIN`, `MAX`, and `AVG` in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values. - -**Context:** We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values! - -**Scope:** SQL, SQL in R - -.Learning Objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc. -- Showcase the ability to filter, alias, and write subqueries. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -For this project, we will be using the `lahman` sqlite database. This database contains the data in the directory - -- `/anvil/projects/tdm/data/lahman` - -You may get some more `lahman` database information from this youtube video http://youtube.com/watch?v=tS_-oTbsDzs -[2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"] - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,python] ----- -%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 (1 pt) - -++++ - -++++ - -[loweralpha] -.. Let's say we are interested in the total number of baseball players for each year, from year 2018 to year 2022, respectively. Please write a query to count the total number of players in the appearances table (by year), and display these totals in, descending order, by year. - -.output -year num_of_players - -2022 1495... - -2021 1798... - -2020 1857... - -2019 1870... - -2018 1918... - -Dr Ward just made up some representative numbers here; these are not the exact numbers! - - -[TIP] -==== -* In the query, give an alias to the `yearID` from the `appearances` table, so that `yearID` appears listed as `year`. Similarly, name the counting of distinct players as an `alias` called `num_of_players`. The `alias` is a great way to not only make the headers look good, but aliases can also be used to reduce the text in a query, by giving some intermediate results a shorter name. The following is the basic syntax of column alias. You may get more information from https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm [alias] - -SELECT column_name AS alias_name -FROM table_name -WHERE [condition] - -==== - - -=== Question 2 (2 pts) - -++++ - -++++ - -++++ - -++++ - -Now, let's look into the `teams` table. The `attendance` column provides the total number of audiences that attended a team's home games. We may say that a team is more popular if it has more attendance at its home games. - -.. Please find out what is the average attendance number for *each* team in the `teams` table, during games from 2022 (only). You should have one average attendance number per team. - -.. Now use a subquery to compute the average attendance across all teams and games. Then modify question 2a, to only include teams whose average attendance for the team is larger than the average across all teams and games. - Using an alias, change the attendance column in your query to appear as "average_attendance". - -[TIP] -The `AVG` function will be useful to calculate average attendance - -[TIP] -We can achieve this using a _subquery_. A subquery is a query that is used to embed a query within another query. - - -=== Question 3 (1 pt) - -++++ - -++++ - -If you answered question (2) correctly, you should find that team `Los Angeles Dodgers`, with team ID 'LAN`, had the highest average attendance. We can consider this team as the most popular team in 2022. - -.. Please calculate the winning percentage for this team in 2022, using the fields 'W' and 'L' from the `teams` table with the formula: - -[source] ----- -winning_percentage = W/(W+L) ----- - -Use the name `winning_per` for the resulting column. - -[IMPORTANT] -==== -Some of you might get a `0` in your output and wonder why the most popular baseball team had a `0` win percentage! What's happening here? How can you fix this? -==== - -=== Question 4 (2 pts) - -++++ - -++++ - -You now know 2 different applications of the `AS` keyword, and you also know how to use a query as a subquery. Great! - -In the previous project, we were introduced to aggregate functions. We know we can use the `WHERE` clause to filter our results, but what if we wanted to filter our results based on an aggregated column? - -.. Update the query from question (3) to print all teams that have winning percentage from year 2012 to 2022 (inclusive) greater than 55%. You should get 3 teams. Display the results, by win percentage, in descending order. - -[TIP] -==== -See https://www.geeksforgeeks.org/having-vs-where-clause-in-sql/[this article] for more information on the `HAVING` and `WHERE` clauses. -==== - - -=== Question 5 (2 pts) - -.. Now let's look at `allstarfull` table. Please list all players who attended 20 or more All Star games. List the players in descending order, by the number of All Star games that they attended. -.. Please explore the tables in the database and write a query about some information that you are interested in. Please make sure to use aliasing, a subquery, and at least one aggregate function. - - Project 10 Assignment Checklist -==== -* Jupyter notebook with your code, comments and output for questions 1 to 5 - ** `firstname-lastname-project10.ipynb` -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project11.adoc deleted file mode 100644 index d0099fadf..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project11.adoc +++ /dev/null @@ -1,105 +0,0 @@ -= TDM 20100: Project 11 -- 2023 - -**Motivation:** Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so, we perform "joins"! In this project we will explore, learn about, and practice using joins on our database. The database has many tables, so the benefit of using joins will become obvious. - -**Context:** We've introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables. - -**Scope:** SQL, sqlite, joins - -.Learning Objectives -**** -- Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem. -- Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. -- Showcase the ability to filter, alias, and write subqueries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - - -For this project, we will be using the `lahman` sqlite database. This database contains the data in the directory - -- `/anvil/projects/tdm/data/lahman` - -You may get some more `lahman` database information from this youtube video http://youtube.com/watch?v=tS_-oTbsDzs -[2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"] - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. - -[source,python] ----- -%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 (2 pts) - -In the previous project, you already learned how to get data from a single table. - -Now that we are learning about _joins_, so that we will have the ability to make much more interesting queries! - -[NOTE] -==== -You may get more information on joins here: https://the-examples-book.com/programming-languages/SQL/joins -==== - -[NOTE] -==== -Table `batting` contains a field H (hits) and a field AB (at-bats). We can calculate the batting average (BA) by the formula - -AVG = H/AB - -A batting average is an indicator that shows a batter's ability to produce offensively. - -You may get more batting average information from Wikipedia: https://en.wikipedia.org/wiki/Batting_average - -==== - -.. Please find the 10 players with the lowest batting average for the year 2022. Use the batting table and INNER JOIN with the people table to get players' first name and last name. The output will contain following fields: playerID, player's first name, player's last name, and their battingAverage. - - -=== Question 2 (2 pts) - -When considering the batting average, pitchers often have a significantly lower bating average, because they are not trained hitters. To focus on regular batters, pitchers need to be excluded. - -.. Use the `appearances` table, to find out the players who are pitchers for the year 2022. - -.. Return to the query from Question 1, but this time, use a subquery to exclude the pitchers from Question 2a. - - -[TIP] -Pitchers have field G_p>0 in appearances table - -=== Question 3 (2 pts) - -In question 2, instead of using a sub query, we can use a left join to accomplish the same task. - -.. Modify your query from question 2, to use a left join (instead of a sub query). The goal is the same as question 2b, namely: to get the 10 players (who are not pitchers!) with lowest batting average. - - -=== Question 4 (2 pts) - - -.. Write another query, to find out what is the average batting average for all players (exclude pitchers) in year 2022. - - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project11.ipynb` -* Submit files through Gradescope -==== - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project12.adoc deleted file mode 100644 index 717839a29..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project12.adoc +++ /dev/null @@ -1,254 +0,0 @@ -= TDM 20100: Project 12 -- 2023 - -**Motivation:** In the previous projects, you've gained experience writing all types of queries, touching on the majority of the main concepts. One critical concept that we _haven't_ yet done is creating your _own_ database. While typically database administrators and engineers will typically be in charge of large production databases, it is likely that you may need to prop up a small development database for your own use at some point in time (and _many_ of you have had to do so this year!). In this project, we will walk through all of the steps to prop up a simple sqlite database for one of our datasets. - -**Context:** We will (mostly) be using the https://www.sqlite.org/[sqlite3] command line tool to interact with the database. - -**Scope:** sql, sqlite, unix - -.Learning Objectives -**** -- Create a sqlite database schema. -- Populate the database with data using `INSERT` statements. -- Populate the database with data using the command line interface (CLI) for sqlite3. -- Run queries on a database. -- Create an index to speed up queries. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The project will use the following datasets: - -* `/anvil/projects/tdm/data/restaurant/orders.csv` -* `/anvil/projects/tdm/data/lahman/lahman.db` - -To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database. For example - -[source,python] ----- -%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db ----- - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -To prepare for this project, create a new Jupyter Notebook called `firstname-lastname-project12.ipynb`. You will put the text of your solutions in this notebook. Next, in Jupyter Lab, open a fresh terminal window. We will be able to run the `sqlite3` command line tool from the terminal window. - -Okay, once completed, the first step is schema creation. First, it is important to note. **The goal of this project is to put the data in `/anvil/projects/tdm/data/restaurant/orders.csv` into a sqlite database ** - -With that in mind, run the following (in your terminal) to get a sample of the data. - -[source,bash] ----- -head /anvil/projects/tdm/data/restaurant/orders.csv ----- - -Review the output data. An SQL schema is a set of text or code that defines how the database is structured and how each piece of data is stored. In a lot of ways it is similar to how a data.frame has columns with different types -- just more "set in stone" than the very easily changed data.frame. - -Each database handles schemas slightly differently. In sqlite, the database will contain a single schema table that describes all included tables, indexes, triggers, views, etc. Specifically, each entry in the `sqlite_schema` table will contain the type, name, tbl_name, root page, and sql for the database object. - -[NOTE] -==== -For sqlite, the "database object" could refer to a table, index, view, or trigger. -==== - -This detail is more than is needed for right now. If you are interested in learning more, the sqlite documentation is very good, and the relevant page to read about this is https://www.sqlite.org/schematab.html[here]. - -For _our_ purposes, when I refer to "schema", what I _really_ mean is the set of commands that will build our tables, indexes, views, and triggers. sqlite makes it particularly easy to open up a sqlite database and get the _exact_ commands to build the database from scratch _without_ the data itself. For example, take a look at our `lahman.db` database by running the following in your terminal. - -[source,bash] ----- -sqlite3 /anvil/projects/tdm/data/lahman/lahman.db ----- - -This will open the command line interface (CLI) for sqlite3. It will look similar to: - -[source,bash] ----- -sqlite> ----- - -Type `.schema` to see the "schema" for the database. - -[NOTE] -==== -Any command you run in the sqlite CLI that starts with a dot (`.`) is called a "dot command". A dot command is exclusive to sqlite and the same functionality cannot be expected to be available in other SQL tools like Postgresql, MariaDB, or MS SQL. You can list all of the dot commands by typing `.help`. -==== - -After running `.schema`, you should see a variety of legitimate SQL commands that will create the structure of your database _without_ the data itself. This is an extremely useful self-documenting tool that is particularly useful. - -So, now let's study the sample of our `orders.csv` dataset to create a markdown list of key:value pairs for each column in the dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column. - -++++ - -++++ - - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - - -.. Create a markdown list of key:value pairs for each column in the `orders.csv` dataset. Each _key_ should be the title of the column, and each _value_ should be the _type_ of data that is stored in that column. - -For example, your solution might be given like this: - -- akeed_order_id: INTEGER -- customer_id: TEXT -- etc., etc. - -where the _value_ is one of the 5 "affinity types" (INTEGER, TEXT, BLOB, REAL, NUMERIC) in sqlite. See section "3.1.1" https://www.sqlite.org/datatype3.html[here]. - -We just showed akeed_order_id and customer_id to give examples about how the first two variables in the data set should be classified. - - -As a side note: Okay, you may be asking, "what is the difference between INTEGER, REAL, and NUMERIC?". Great question. In general (for other SQL RDBMSs), there are _approximate_ numeric data types and _exact_ numeric data types. What you are most familiar with is the _approximate_ numeric data types. In R or Python for example, try running the following: - -[source,r] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -FALSE ----- - -[source,python] ----- -(3 - 2.9) <= 0.1 ----- - -.Output ----- -False ----- - -Under the hood, the values are stored as a very close approximation of the real value. This small amount of error is referred to as floating point error. There are some instances where it is _critical_ that values are stored as exact values (for example, in finance). In those cases, you would need to use special data types to handle it. In sqlite, this type is NUMERIC. So, for _our_ example, store text as TEXT, numbers _without_ decimal places as INTEGER, and numbers with decimal places as REAL -- our example dataset doesn't have a need for NUMERIC. - - - - -=== Question 2 (2 pts) - -++++ - -++++ - - -.. Create a database named "orders.db" and a table named "orders" by following the instructions below - -[NOTE] -==== -Let's put together our `CREATE TABLE` statement that will create our table in the database. - -See https://www.sqlitetutorial.net/sqlite-create-table/[here] for some good examples. Realize that the `CREATE TABLE` statement is not so different from any other query in SQL, and although it looks messy and complicated, it is not so bad. Name your table `orders`. - -Once you've written your `CREATE TABLE` statement, create a new, empty database by running the following in a terminal: `sqlite3 $HOME/orders.db`. Copy and paste the `CREATE TABLE` statement into the sqlite CLI. Upon success, you should see the statement printed when running the dot command `.schema`. Fantastic! You can also verify that the table exists by running the dot command `.tables`. - -Congratulations! To finish things off, please paste the `CREATE TABLE` statement into a markdown cell in your notebook. In addition, include a screenshot of your `.schema` output after your `CREATE TABLE` statement was run. -==== - - -=== Question 3 (2 pts) - -++++ - -++++ - -The next step in the project is to add the data! After all, it _is_ a _data_ base. You may get how to insert data into table from https://www.sqlitetutorial.net/sqlite-insert/[here] - -.. Please populate the data from `orders.csv` into your `orders` table -.. Connect to "orders.db" and run a query to get the first 5 rows from "orders" table. - - -[TIP] -==== -You could programmatically generate a `.sql` file with the `INSERT INTO` statement, hook the database up with Python or R and insert the data that way, _or_ you could use the wonderful dot commands sqlite like following: - -[source,bash] ----- -.mode csv -.import --skip 1 /anvil/projects/tdm/data/restaurant/orders.csv orders ----- -==== - - -[TIP] -==== -To connect to the database: - -[source,python] ----- -%sql sqlite:///$HOME/orders.db ----- -==== - -[TIP] -==== -To select data from the table: - -[source,python] ----- -%sql SELECT * FROM orders LIMIT 5 ----- -==== - - -=== Question 4 (2 pts) - -++++ - -++++ - -Woohoo! You've successfully created a database and populated it with data from a dataset -- pretty cool! Connect to your database from inside a terminal. - -[source,bash] ----- -sqlite3 $HOME/orders.db ----- - -Now, run the following dot command in order to _time_ our queries: `.timer on`. This will print out the time it takes to run each query. For example, try the following: - -[source, sql] ----- -SELECT * FROM orders LIMIT 5; ----- - -Cool! Time the following query. - -[source, sql] ----- -SELECT * FROM orders ORDER BY created_at LIMIT 10; ----- - -.Output ----- -Run Time: real 0.021 user 0.000261 sys 0.004553 ----- - -Running time is often critical, particularly during large-scale database searches. Let's explore some techniques to enhance performance through the use of indexing in tables. You may get more information about index here: https://www.sqlitetutorial.net/sqlite-index/ - -.. Create an index for column "created_at". - - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb` - -* Sql file 'orders.db' (this file should be approximately 22 MB) -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project13.adoc deleted file mode 100644 index 5b82e6c9f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project13.adoc +++ /dev/null @@ -1,175 +0,0 @@ -= TDM 20100: Project 13 -- 2023 - -**Motivation:** We've covered a lot about SQL in a relatively short amount of time, but we still haven't touched on some other important SQL topics. In this project, we will touch on some other important SQL topics. - -**Context:** In the previous project, you had the opportunity to take the time to insert data into a `sqlite3` database. There are still many common tasks that you may need to perform using a database: triggers, views, transaction, and even a few `sqlite3`-specific functionalities that may prove useful. - -**Scope:** SQL - -.Learning Objectives -**** -- Create a trigger on your `sqlite3` database and demonstrate that it works. -- Create one or more views on your `sqlite3` database and demonstrate that they work. -- Describe and use a database transaction. Rollback a transaction. -- Optionally, use the `sqlite3` "savepoint", "rollback to", and "release" commands. -- Optionally, use the `sqlite3` "attach" and "detach" commands to execute queries across multiple databases. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -For this project, we will be using the `lahman` sqlite database. This database contains the data in the directory - -- `/anvil/projects/tdm/data/lahman` - -You may get some more `lahman` database information from this youtube video: http://youtube.com/watch?v=tS_-oTbsDzs[2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"] - -For every following cell where you want to run a SQL query, prepend `%%sql` to the top of the cell -- just like we do for R or bash cells. - -== Questions - -=== Question 1 (2 pts) - -.. Following the instructions to create a new column and a trigger for table "teams" -.. Update the table "teams" and display the updated information - -[NOTE] -==== -Begin by copying the database from the previous project to your `$HOME` directory. Open up a terminal and run the following. - -[source,bash] ----- -cp /anvil/projects/tdm/data/lahman/lahman.db $HOME ----- - -Go ahead and launch `sqlite3` and connect to the database from your home directory. - -[source,bash] ----- -sqlite3 $HOME/lahman.db ----- - -From within `sqlite3`, test things out to make sure the data looks right. - -[source, sql] ----- - -SELECT * FROM teams LIMIT 5; ----- - - -With any luck, things should be working just fine. - -Let's go ahead and create a trigger. A trigger is what it sounds like, given a specific action, _do_ a specific action. This is a powerful tool. One of the most common uses of a trigger that you will see in the wild is the "updated_at" field. This is a field that stores a datetime value, and uses a _trigger_ to automatically update to the current date and time anytime a record in the database is updated. - -First, we need to create a new column called "updated_at", and set the default value to something. In our case, lets set it to January 1, 1970 at 00:00:00. - -[source, sql] ----- -ALTER TABLE teams ADD COLUMN updated_at DATETIME DEFAULT '1970-01-01 00:00:00'; ----- - -If you query the table now, you will see all of the values have been properly added, great! - -[source, sql] ----- -SELECT * FROM teams LIMIT 5; ----- - -Now add a trigger called "update_teams_updated_at" that will update the "updated_at" column to the current date and time whenever a record is updated. Check out the official documentation https://www.sqlite.org/lang_createtrigger.html[here] for examples of triggers. - -Once your trigger has been written, go ahead and test it out by updating the following record. - -[source, sql] ----- -UPDATE teams SET teamRank = 3 WHERE YearID = 2022 AND TEAMID ='ARI'; ----- - -[source, sql] ----- -SELECT * FROM TEAMS WHERE YearID = 2022 AND TEAMID ='ARI' ; ----- - -If it worked right, your `updated_at` column should have been updated to the current date and time, cool! -==== - -=== Question 2 (2 pts) - -[NOTE] -==== -Next, we will touch on _views_. A view is essentially a virtual table that is created from some query and given a name. Why would you want to create such a thing? Well, there could be many reasons. - -Maybe you have a complex query that you need to run frequently, and it would just be easier to see the final result with a click? Maybe the database has horrible naming conventions and you want to rename things in a view to make it more readable and/or queryable? - -After some thought, it may occur to you that we've had such an instance where a view could be nice using our `lahman.db` database! - -You may get more information about "view" here: https://www.sqlitetutorial.net/sqlite-create-view/ -==== - -.. Create a _view_ called "players_with_awards_2020" that will provide information for a player. It should include the player's name, height, weight, and if the play has an award in 2020; use the year 2020 data, joining the "people" and "awardsplayers" tables. -.. Display 5 records from the view "players_with_awards_2020" -[TIP] -==== -- use "playerID" to join two tables -==== - -=== Question 3 (2 pts) - - -Read the official `sqlite3` documentation for transactions https://www.sqlite.org/lang_transaction.html[here]. As you will read, you've already been using transactions each time you run a query! What we will focus on is how to use transactions to _rollback_ changes, as this is probably the most useful use case you'll run into. - -Connect to our "lahman.db" database from question (1), start a _deferred_ transaction, and update a row, similar to what we did before, using the following query. - -[source, sql] ----- -UPDATE teams SET teamRank = 30 WHERE yearID = 2022 AND teamID = 'ARI'; ----- - -Now, query the record to see what it looks like. - -[source, sql] ----- -SELECT * FROM teams WHERE yearID = 2022 AND teamID ='ARI' and teamRank = 30; ----- - -[NOTE] -==== -You'll notice our _trigger_ from before is still working, cool! -==== - -This is pretty great, until you realized that the teamRank was not right! Oh no! Well, at this stage you haven't committed your transaction yet, so you can just _rollback_ the changes and everything will be back to normal. Give it a try (again, following the official documentation). - -After rolling back, run the following query. - -[source, sql] ----- -SELECT * FROM teams WHERE yearID = 2022 AND teamID = 'ARI' ; ----- - -As you can see, the data changed back to the original one! As you can imagine, this is pretty powerful stuff, especially if you are writing to a database and want to make sure things look right before _committing_ the changes. - - -=== Question 4 (2 pts) - -SQL and `sqlite3` are powerful tools, and we've barely scratched the surface. Check out the https://www.sqlite.org/docs.html[offical documentation], and demonstrate another feature of `sqlite3` that we haven't yet covered. - -Some suggestions, if you aren't interested in browsing the documentation: https://www.sqlite.org/windowfunctions.html#biwinfunc[window functions], https://www.sqlite.org/lang_mathfunc.html[math functions], https://www.sqlite.org/lang_datefunc.html[date and time functions], and https://www.sqlite.org/lang_corefunc.html[core functions] (there are many we didn't use!) - -Please make sure the queries you run are run from an sql cell in your Jupyter notebook. - - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* Submit the copy of the `lahman.db` file that you made in your home directory. -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project14.adoc deleted file mode 100644 index fb8d78eb3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-project14.adoc +++ /dev/null @@ -1,60 +0,0 @@ -= TDM 20100: Project 14 -- Fall 2023 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** Unix, SQLite, R, Python, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, modules, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, module, or package. - -=== Question 3 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 4 (2 pts) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 5 (2 pts) - -.. Please identify 3 skills or topics in data science areas you are interested in, you may choose from the following list or create your own list. Please briefly explain the reason you think the topics will be beneficial, with examples. - -- database optimization -- containerization -- machine learning -- generative AI -- deep learning -- cloud computing -- DevOps -- GPU computing -- data visualization -- time series and spatial statistics -- predictive analytics -- (if you have other topics that you want Dr Ward to add, please feel welcome to post in Piazza, and/or just add your own topics when you answer this question) - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-projects.adoc deleted file mode 100644 index c9f2169fd..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/20100/20100-2023-projects.adoc +++ /dev/null @@ -1,45 +0,0 @@ -= TDM 20100 - -xref:fall2023/logistics/office_hours_201.adoc[[.custom_button]#TDM 201 Office Hours#] -xref:fall2023/logistics/201_TAs.adoc[[.custom_button]#TDM 201 TAs#] -xref:fall2023/logistics/syllabus.adoc[[.custom_button]#Syllabus#] - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$20100-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2023/tdm20100[Sign Up] - -=== Link - -https://piazza.com/purdue/fall2023/tdm20100/home[Homepage] - -== Syllabus - -See xref:fall2023/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project01.adoc deleted file mode 100644 index 0b0df17c9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project01.adoc +++ /dev/null @@ -1,434 +0,0 @@ -= TDM 30100: Project 1 -- 2023 - -**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python with data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time. - -We will start slowly, however, by learning about Jupyter Lab. In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -.Insider Knowledge -[%collapsible] -==== -IDE stands for Integrated Developer Environment: software that helps us program cleanly and efficiently. -==== - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** R, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Read and write basic (.csv) data using R. -**** - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` -- `/anvil/projects/tdm/data/disney/flight_of_passage.csv` - -== Setting Up to Work - - -++++ - -++++ - - -This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.) - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory. - -[NOTE] -==== -As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough. -==== - -[NOTE] -==== -Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -seminar:: -The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -seminar-r:: -The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you. - - -The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work). - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -[TIP] -==== -Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -==== - - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts) -.. How many cores and how much memory (in GB) does your personal computer have? - - -++++ - -++++ - - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to manually calculate how many cores and how much memory is available for Anvil's "sub-cluster A". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[TIP] -==== -Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here]. - -Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings. -==== - -.Items to submit -==== -- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A. -- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. Using Python, what is the name of the node on Anvil you are running on? -.. Using Bash, what is the name of the node on Anvil you are running on? -.. Using R, what is the name of the node on Anvil you are running on? - -++++ - -++++ - -Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[source,r] ----- -%%R - -system("hostname", intern=TRUE) ----- - -[source,bash] ----- -%%bash - -hostname ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about as different languages can handle output differently, but it is good to take note of. - -.Items to submit -==== -- Code used to solve this problem, along with the output of running that code. -==== - -=== Question 3 (1 pt) -[upperalpha] -.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question. - -++++ - -++++ - - -[TIP] -==== -Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know! -==== - -In this course, we will be using Jupyter Lab with multiple different languages. Often, we will center a project around a specific language and choose the kernel for that langauge appropriately, but occasionally we may need to run a language in a kernel other than the one it is primarily built for. The solution to this is using line magic! - -Line magic tells our code interpreter that we are using a language other than the default for our kernel (i.e. The `seminar` kernel we are currently using is expecting Python code, but we can tell it to expect R code instead.) - -Line magic works by having the very first line in a code cell formatted like so: - -`%%language` - -Where `language` is the language we want to use. For example, if we wanted to run R code in our `seminar` kernel, we would use the following line magic: - -`%%R` - -Practice running the following examples, which include line magic where needed. - -python:: -[source,python] ----- -import pandas as pd -df = pd.read_csv('/anvil/projects/tdm/data/flights/subset/1991.csv') ----- - -[source,python] ----- -df[df["Month"]==12].head() # get all flights in December ----- - -SQL:: -[source, ipython] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, sql] ----- -%%sql - --- get all episodes called "Finale" -SELECT * -FROM episodes AS e -INNER JOIN titles AS t -ON t.title_id = e.episode_title_id -WHERE t.primary_title = 'Finale' -LIMIT 5; ----- - -bash:: -[source,bash] ----- -%%bash - -names="John Doe;Bill Withers;Arthur Morgan;Mary Jane;Rick Ross;John Marston" -echo $names | cut -d ';' -f 3 -echo $names | cut -d ';' -f 6 ----- - -[NOTE] -==== -In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell). - -In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know! - -To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/templates[our template page]. -==== - -.Items to submit -==== -- Code from the examples above, and the outputs produced by running that code. -==== - -=== Question 4 (2 pts) -[upperalpha] -.. Using Python, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (1 pt) -.. Using R, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (1 pt) - - -++++ - -++++ - - -[NOTE] -==== -"Comments" are text in code cells that are not "run" as code. They serve as helpful notes on how your code works. Always comment your code well enough that you can come back to it after a long amount of time and understand what you wrote. In R and Python, single-line comments can be made by putting `#` at the beginning of the line you want commented out. -==== - -[NOTE] -==== -Spacing in code is sometimes important, sometimes not. The two things you can do to find out what applies in your case are looking at documentation online and experimenting on your own, but we will also try to stress what spacing is mandatory and what is a style decision in our videos. -==== - -In question 1 we answered questions about cores and memory for the Anvil clusters. This time, we want you to convert your GB memory amount from question 1 into bytes and terabytes. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells. - -[TIP] -==== -A Gigabyte is 1,000,000,000 bytes. -A Terabyte is 1,000 Gigabytes. -==== - -[TIP] -==== -https://www.datamentor.io/r-programming/operator[This link] will point you to resources about how to use basic operators in R, and https://www.tutorialspoint.com/python/python_basic_operators.htm[this one] will teach you about basic operators in Python. -==== - -.Items to submit -==== -- Python code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -- R code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -==== - -=== Question 5 (2 pts) -[upperalpha] -.. Load the "flight_of_passage.csv" data into an R dataframe called "dat". -.. Take the head of "dat" to ensure your data loaded in correctly. -.. Change the name of "dat" to "flight_of_passage", remove the reference to "dat", and then take the head of "dat" and "flight of passage" in order to ensure that your actions were successful. - - -++++ - -++++ - - -In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code. - -[source,ipython] ----- -%%R - -dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv") ----- - -Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset. - -[source,r] ----- -%%R - -head(dat) ----- - -[IMPORTANT] -==== -Remember -- if you are in a _new_ code cell on the , you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good! -==== - -`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`. - -Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure there is a level 2 header titled "Question 5", above your code cell. - -[TIP] -==== -In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. -==== - -[NOTE] -==== -We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this. - -[source,r] ----- -flight_of_passage <- dat ----- - -Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5GB dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset. - -[source,r] ----- -rm(dat) ----- -==== - -.Items to submit -==== -- Code to load the data into a dataframe called `dat` and take the head of that data, and the output of that code. -- Code to change the name of `dat` to `flight_of_passage` and remove the variable `dat`, and to take the head of `flight_of_passage` to ensure the name-change worked. -==== - -=== Question 6 (1 pt) - -++++ - -++++ - -Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. - -[NOTE] -==== -Your `bash` solution can be both plotless and without a custom function. -==== - -Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Submitting your Work - - -++++ - -++++ - -Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope. - -.Items to submit -==== -- `firstname-lastname-project01.ipynb`. -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.sql`. -- `firstname-lastname-project01.sh`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project02.adoc deleted file mode 100644 index 5bfc6d1f8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project02.adoc +++ /dev/null @@ -1,329 +0,0 @@ -= TDM 30100: Project 2 -- 2023 -:page-mathjax: true - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the first project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/apple/health/watch_dump.xml` - -== Questions - -In this project we will work with `pdoc` to build some simple documentation, review some Python skills that may be rusty, and learn about a serialization and deserialization of data -- a common component to many data science and computer science projects, and a key topics to understand when working with APIs. - -For the sake of clarity, this project will have more deliverables than the "standard" `.ipynb` notebook, `.py` file containing Python code, and PDF. In this project, we will ask you to submit an additional PDF showing the documentation webpage that you will have built by the end of the project. How to do this will be made clear in the given question. - -[WARNING] -==== -Make sure to select 3 cores of memory for this project, otherwise you may get an issue reading the dataset in question 3. -==== - -=== Question 1 (2 pts) -[upperalpha] -.. Create a module-level docstring as described below. You will submit the Python file containing this docstring at the end of this project. - -Let's start by navigating to https://ondemand.anvil.rcac.purdue.edu, and launching a Jupyter Lab instance. In the previous project, you learned how to run various types of code in a Jupyter notebook (the `.ipynb` file). Jupyter Lab is actually _much_ more useful. You can open terminals on Anvil (the cluster), as well as open an editor for `.R` files, `.py` files, or any other text-based file. - -Give it a try. In the "Other" category in the Jupyter Lab home page, where you would normally select the "seminar" kernel, instead select the "Python File" option. Upon clicking the square, you will be presented with a file called `untitled.py`. Rename this file to `firstname-lastname-project02.py` (where `firstname` and `lastname` are your first and last name, respectively). - -[TIP] -==== -Make sure you are in your `$HOME` directory when clicking the "Python File" square. Otherwise you may get an error stating you do not have permissions to create the file. -==== - -Read the https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings["3.8.2 Modules" section] of Google's Python Style Guide. Each individual `.py` file is called a Python "module". It is good practice to include a module-level docstring at the top of each module. Create a module-level docstring for your new module. Rather than giving an explanation of the module, and usage examples, instead include a short description (in your own words, 3-4 sentences) of the terms "serialization" and "deserialization". In addition, list a few (at least 2) examples of different serialization formats, and include a brief description of the format, and some advantages and disadvantages of each. Lastly, if you could break all serialization formats into 2 broad categories, what would those categories be, and why? - -[TIP] -==== -Any good answer for the "2 broad categories" will be accepted. With that being said, a hint would be to think of what the **serialized** data _looks_ like (if you tried to open it in a text editor, for example), or how it is _read_. -==== - -Save your module. - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- A module-level docstring with all of the above requirements. -==== - -=== Question 2 (2 pts) -[upperalpha] -.. Write a line of `bash` to generate documentation using pdoc. - -Now, in Jupyter Lab, open a new notebook using the "seminar" kernel. - -[TIP] -==== -You can have _both_ the Python file _and_ the notebook open in separate Jupyter Lab tabs for easier navigation. -==== - -Fill in a code cell for question 1 with a Python comment. - -[source,python] ----- -# See firstname-lastname-project02.py ----- - -For this question, read the xref:programming-languages:python:pdoc.adoc[pdoc section], and run a `bash` command to generate the documentation for your module that you created in the previous question, `firstname-lastname-project02.py`. To do this, look at the example provided in the book. Everywhere in the example in the pdoc section of the book where you see "mymodule.py" replace it with _your_ module's name -- `firstname-lastname-project02.py`. - -As an optional step, you can write bash code to create a documentation directory for this project. This is not required, but is good practice in order to keep your projects organized. An example of how to do this is below, using the `-p` flag to only create the directory if it does not already exist: - -[source,bash] ----- -mkdir -p $HOME/project2/docs ----- - -[WARNING] -==== -Use `python3` **not** `python` in your command. - -We are expecting you to run the command in a `bash` cell (likely using line magic). Additionally, ensure that your documentation directory is empty, as you will submit it at the end of this project and it should only have files for this project in it. Below are a few hints on how to get started writing this command. - -[source,bash] ----- -python3 -m pdoc [other commands here] ----- -==== - -[TIP] -==== -Use the `-o` flag to specify the output directory -- I would _suggest_ making it somewhere in your `$HOME` directory to avoid permissions issues. - -For example, I used `$HOME/project2/docs`. -==== - -[TIP] -==== -You can use the `d` flag to specify the type of of docstring you are using. For example, you can use `-d numpy` to specify that you are using numpy-style docstrings. -==== - -Once complete, on the left-hand side of the Jupyter Lab interface, navigate to your documentation directory. You should see something called `firstname-lastname-project02.html`. To view this file in your browser, right click on the file, and select btn:[Open in New Browser Tab]. A new browser tab should open with your freshly made documentation. Pretty cool! - -[IMPORTANT] -==== -Ignore the `index.html` file -- we are looking for the `firstname-lastname-project02.html` file. -==== - -[TIP] -==== -You _may_ have noticed that the docstrings are (partially) markdown-friendly. Try introducing some markdown formatting in your docstring for more appealing documentation. -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Any and all `bash` used to generate your documentation. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Write a function called `get_records_for_date` with the functionality described below. -.. Write a Google-style docstring for the function, and regenerate your documentation. - -[NOTE] -==== -Any references to "watch data" just mean the dataset for this project. -==== - -In your `firstname-lastname-project02.py` file, write a function called `get_records_for_date` that accepts an `lxml` etree (of our watch data, via `etree.parse`), and a `datetime.date`, and returns a list of Record Elements, for a given date. Raise a `TypeError` if the date is not a `datetime.date`, or if the etree is not an `lxml.etree`. This should be included in both your `.ipynb` and `.py` files. - -Use the https://google.github.io/styleguide/pyguide.html#383-functions-and-methods[Google Python Style Guide's "Functions and Methods" section] to write the docstring for this function. Be sure to include type annotations for the parameters and return value. - -Re-generate your documentation. How does the updated documentation look? You may notice that the formatting is pretty ugly and things like "Args" or "Returns" are not really formatted in a way that makes it easy to read. - -Use the `-d` flag to specify the format as "google", and re-generate your documentation. How does the updated documentation look? - -[TIP] -==== -The following code should help get you started. - -[source,python] ----- -import lxml -import lxml.etree -from datetime import datetime, date - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list[lxml.etree._Element]: - # docstring goes here - - # test if `tree` is an `lxml.etree._ElementTree`, and raise TypeError if not - - # test if `for_date` is a `datetime.date`, and raise TypeError if not - - # loop through the records in the watch data using the xpath expression `/HealthData/Record` - - # how to see a record, in case you want to. (DO NOT PUT WITHIN THE FOR LOOP, OR YOU WILL GET A LOT OF OUTPUT AND POTENTIALLY AN ERROR) - print(lxml.etree.tostring(record)) - - # test if the record's `startDate` is the same as `for_date`, and append to a list if it is - - # return the list of records - -# how to test this function -tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml') -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) -my_records ----- - -.output ----- -[, - , - , - , - , - , - , - , - , - , - , - , - .... ----- -==== - -[TIP] -==== -The following is some code that will be helpful to test the types. - -[source,python] ----- -from datetime import datetime, date - -isinstance(some_date_object, date) # test if some_date_object is a date -isinstance(some_xml_tree_object, lxml.etree._ElementTree) # test if some_xml_tree_object is an lxml.etree._ElementTree ----- -==== - -[TIP] -==== -To loop through records, you can use the `xpath` method. - -[source,python] ----- -for record in tree.xpath('/HealthData/Record'): - # do something with record ----- -==== - -[TIP] -==== -The `attrib` method will allow you to access a specific attribute of a record. For example, `record.attrib['endDate']` will return the `endDate` attribute of a record. However, this simply returns a string and not a datetime `date` object. If you are having trouble figuring out how to appropriately make the comparison between `for_date` and the date of a record, take a look back at the above code for testing your function. It _may_ include some functions to help you out. -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- get_records_for_date function as described above, with an appropriate, Google-style docstring. -==== - -=== Question 4 (2 pts) -[upperalpha] -.. Modify your module so you do not need to pass the `-d` flag in order to let pdoc know that you are using Google-style docstrings. -.. Write a new function called `quad` that calculates the roots of a given quadratic equation. -.. Write a docstring for this function that includes math formulas, and render them appropriately using pdoc. -.. Add a logo to your documentation. - -This was _hopefully_ a not-too-difficult project that gave you some exposure to tools in the Python ecosystem, as well as chipped away at any rust you may have had with writing Python code. - -To end things off, investigate the https://pdoc.dev/docs/pdoc.html[official pdoc documentation] in order to answer the rest of this question. - -You will notice that there is a way to specify the docstring format in your module, so that you do not need to pass the `-d` flag when generating your documentation. Modify your module so that you do not need to pass the `-d` flag when generating your documentation. - -Next, write a function called `quad` that accepts 3 parameters representing the coefficients of a quadratic equation, `a`, `b`, and `c`, and prints the roots of the equation. Raise a `TypeError` if any of the parameters are not `int` or `float`. Raise a `ValueError` if `a` is 0. Each root should be separated by a comma. Write a docstring for this function that includes math formulas, and render them appropriately using pdoc. Ensure that this function appears in both your `.ipynb` and `.py` files. - -[NOTE] -==== -Below is the quadratic formula you should implement for this question: - -$x=\frac{-b\pm\sqrt{b^2-4ac}}{2a}$ - -Lastly, add a logo to your documentation. You can use the Purdue logo, or any other logo you would like (as long as it is appropriate). -==== - -[NOTE] -==== -At the time of this project's writing, the Purdue logo can be found at https://upload.wikimedia.org/wikipedia/commons/3/35/Purdue_Boilermakers_logo.svg[this link]. -==== - -**Relevant topics:** xref:programming-languages:python:pdoc.adoc[pdoc], xref:programming-languages:python:sphinx.adoc[Sphinx], xref:programming-languages:python:docstrings-and-comments.adoc[Docstrings & Comments] - -.Items to submit -==== -- Modified module to specify your docstring format. -- New `quad` function as described above. -- Appropriate docstring for `quad` function, including properly rendered math formula. -- Documentation with logo. -==== - - -=== Submitting your Work -[WARNING] -==== -The submission requirements for this project are a bit complicated. Please take care to read this section carefully to ensure you recieve full credit for the work you did. -==== - -.Items to submit -==== -For this project, please submit the following files: - -- The `.ipynb` file with: - - a simple comment for question 1, - - a `bash` cell for question 2 with code that generates your `pdoc` html documentation, - - a code cell with your `get_records_for_date` function (for question 3) - - a code cell with the results of running - + -[source,python] ----- -# read in the watch data -tree = lxml.etree.parse('/anvil/projects/tdm/data/apple/health/watch_dump.xml') - -chosen_date = datetime.strptime('2019/01/01', '%Y/%m/%d').date() -my_records = get_records_for_date(tree, chosen_date) -my_records ----- - - a `bash` code cell with the code that generates your `pdoc` documentation as described in question 4. - - a code cell with your `quad` function (for question 4) - - a code cell with the results of running - + -[source,python] ----- -quad(3, -11, 4) ----- -- An `.html` file with your newest set of documention (including your question 4 modifications) -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project03.adoc deleted file mode 100644 index fc0056a8d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project03.adoc +++ /dev/null @@ -1,437 +0,0 @@ -= TDM 30100: Project 3 -- 2023 - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the second project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/apple/health/watch_dump.xml` - -== Questions - -[WARNING] -==== -Please use Firefox for this project. While other browsers like Chrome and Edge may work, we are providing instructions that are specific to Firefox and you may need to do a bit of research before getting another browser to work. - -Before you begin, open Firefox, and where you would normally put a URL, type the following, followed by enter/return. - -``` -about:config -``` - -Search for `network.cookie.sameSite.laxByDefault`, and change the value to `false`, and close the tab. (This was set to `false` in my browser, so don't be concerned if yours isn't `true` by default. Just ensure it is set to `false` before starting the project.) -==== - -=== Question 1 (2 pt) -[upperalpha] -.. Create a new directory in your `$HOME` directory called `project03`: `$HOME/project03` -.. Create a new copy of the project template in a Jupyter notebook in your project03 folder called project03.ipynb. -.. Create a module called `firstname_lastname_project03.py` in your `$HOME/project03` directory, with the contents of the previous project. -.. Write a module-level docstring for your project03 module. -.. Write a function-level docstring for the `get_records_for_date` function. - -[IMPORTANT] -==== -You may be concerned that this project will leave your Jupyter notebook looking empty. This is intended, as the majority of the deliverables for this project will be the documentation generated by bash code you will write soon. Additionally, we will explicity specify what the deliverables are step-by-step in each question, so you will know exactly what to submit. -==== - -First, start by creating your new directory and copying in the template. While the deliverables say this has to have a path of `$HOME/project03`, you can put it anywhere you want, just note that you will have to update your code to reflect the location you choose and your final submission should not contain files unrelated to this specific project. - -Next, copy the code you wrote in the previous project into a new python file in your project 3 directory called `firstname_lastname_project03.py`. If you didn't finish the previous project, feel free to copy in the below code to get up-to-date. Then fill in a module-level docstring for the module along with a function-level docstring for the `get_records_for_date` function, both using Google style docstrings. - -[NOTE] -==== -Make sure you change "firstname" and "lastname" to _your_ first and last name. -==== - -[NOTE] -==== -This is simply the code from the previous project that you wrote, along with all the docstrings you wrote. If you did not complete the previous project or get things working for whatever reason, feel free to use the code below. Otherwise, copy and paste your code from the previous project. -==== - -[source,python] ----- -""" -This module is for project 3 for TDM 30100. - -**Serialization:** Serialization is the process of taking a set or subset of data and transforming it into a specific file format that is designed for transmission over a network, storage, or some other specific use-case. - -**Deserialization:** Deserialization is the opposite process from serialization where the serialized data is reverted back into its original form. - -The following are some common serialization formats: - -- JSON -- Bincode -- MessagePack -- YAML -- TOML -- Pickle -- BSON -- CBOR -- Parquet -- XML -- Protobuf - -**JSON:** One of the more wide-spread serialization formats, JSON has the advantages that it is human readable, and has a excellent set of optimized tools written to serialize and deserialize. In addition, it has first-rate support in browsers. A disadvantage is that it is not a fantastic format storage-wise (it takes up lots of space), and parsing large JSON files can use a lot of memory. - -**MessagePack:** MessagePack is a non-human-readable file format (binary) that is extremely fast to serialize and deserialize, and is extremely efficient space-wise. It has excellent tooling in many different languages. It is still not the *most* space efficient, or *fastest* to serialize/deserialize, and remains impossible to work with in its serialized form. - -Generally, each format is either *human-readable* or *not*. Human readable formats are able to be read by a human when opened up in a text editor, for example. Non human-readable formats are typically in some binary format and will look like random nonsense when opened in a text editor. - -""" -import lxml -import lxml.etree -from datetime import datetime, date - - -def get_records_for_date(tree: lxml.etree._ElementTree, for_date: date) -> list: - """ - insert function-level docstring here - """ - - if not isinstance(tree, lxml.etree._ElementTree): - raise TypeError('tree must be an lxml.etree') - - if not isinstance(for_date, date): - raise TypeError('for_date must be a datetime.date') - - results = [] - for record in tree.xpath('/HealthData/Record'): - if for_date == datetime.strptime(record.attrib.get('startDate'), '%Y-%m-%d %X %z').date(): - results.append(record) - - return results ----- - -Next, in a `bash` cell in your `project03.ipynb` notebook, run the following, replacing "Firstname Lastname" with your name. This code will initialize a new Sphinx project inside your `project03` directory, and we will explore the actual contents and purpose of the files generated throughout this project. Before moving on though, be sure to read through https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official Sphinx documentation] to understand exactly what all of the arguments in this command do. - -[source,ipython] ----- -%%bash - -cd $HOME/project03 -python3 -m sphinx.cmd.quickstart ./docs -q -p project03 -a "Firstname Lastname" -v 1.0.0 --sep ----- - -[NOTE] -==== -What do all of these arguments do? Check out https://www.sphinx-doc.org/en/master/man/sphinx-quickstart.html[this page of the official documentation]. -==== - -You should be left with a newly created `docs` directory within your `project03` directory: `$HOME/project03/docs`. The directory structure should look similar to the following. - -.contents ----- -project03<1> -├── 39000_f2021_project03_solutions.ipynb<2> -├── docs<3> -│   ├── build <4> -│   ├── make.bat -│   ├── Makefile <5> -│   └── source <6> -│   ├── conf.py <7> -│   ├── index.rst <8> -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py<9> - -5 directories, 6 files ----- - -<1> Our module (named `project03`) folder -<2> Your project notebook (probably named something like `firstname_lastname_project03.ipynb`) -<3> Your documentation folder -<4> Your empty build folder where generated documentation will be stored (inside `docs`) -<5> The Makefile used to run the commands that generate your documentation (inside `docs`) -<6> Your source folder. This folder contains all hand-typed documentation (inside `docs`) -<7> Your conf.py file. This file contains the configuration for your documentation. (inside `source`) -<8> Your index.rst file. This file (and all files ending in `.rst`) is written in https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[reStructuredText] -- a Markdown-like syntax. (inside `source`) -<9> Your module. This is the module containing the code from the previous project, with nice, clean docstrings. (also given above) - -Please make the following modifications: - -. To Makefile: -+ -[source,bash] ----- -# replace -SPHINXOPTS ?= -SPHINXBUILD ?= sphinx-build -SOURCEDIR = source -BUILDDIR = build - -# with the following -SPHINXOPTS ?= -SPHINXBUILD ?= python3 -m sphinx.cmd.build -SOURCEDIR = source -BUILDDIR = build ----- -+ -. To conf.py: -+ -[source,python] ----- -# CHANGE THE FOLLOWING CONTENT FROM: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -# import os -# import sys -# sys.path.insert(0, os.path.abspath('.') - -# TO: - -# -- Path setup -------------------------------------------------------------- - -# If extensions (or modules to document with autodoc) are in another directory, -# add these directories to sys.path here. If the directory is relative to the -# documentation root, use os.path.abspath to make it absolute, like shown here. -# -import os -import sys -sys.path.insert(0, os.path.abspath('../..')) ----- - -Finally, with the modifications above having been made, run the following command in a `bash` cell in Jupyter notebook to generate your documentation. - -[source,bash] ----- -cd $HOME/project03/docs -make html ----- - -After complete, your module folders structure should look something like the following. - -.structure ----- -project03 -├── 39000_f2021_project03_solutions.ipynb -├── docs -│   ├── build -│   │   ├── doctrees -│   │   │   ├── environment.pickle -│   │   │   └── index.doctree -│   │   └── html -│   │   ├── genindex.html -│   │   ├── index.html -│   │   ├── objects.inv -│   │   ├── search.html -│   │   ├── searchindex.js -│   │   ├── _sources -│   │   │   └── index.rst.txt -│   │   └── _static -│   │   ├── alabaster.css -│   │   ├── basic.css -│   │   ├── custom.css -│   │   ├── doctools.js -│   │   ├── documentation_options.js -│   │   ├── file.png -│   │   ├── jquery-3.5.1.js -│   │   ├── jquery.js -│   │   ├── language_data.js -│   │   ├── minus.png -│   │   ├── plus.png -│   │   ├── pygments.css -│   │   ├── searchtools.js -│   │   ├── underscore-1.13.1.js -│   │   └── underscore.js -│   ├── make.bat -│   ├── Makefile -│   └── source -│   ├── conf.py -│   ├── index.rst -│   ├── _static -│   └── _templates -└── kevin_amstutz_project03.py - -9 directories, 29 files ----- - -Finally, let's take a look at the results! In the left-hand pane in the Jupyter Lab interface, navigate to `yourpath/project03/docs/build/html/`, and right click on the `index.html` file and choose btn:[Open in New Browser Tab]. You should now be able to see your documentation in a new tab. It should look something like the following. - -image::figure34.webp[Resulting Sphinx output, width=792, height=500, loading=lazy, title="Resulting Sphinx output"] - -[IMPORTANT] -==== -Make sure you are able to generate the documentation before you proceed, otherwise, you will not be able to continue to modify, regenerate, and view your documentation. -==== - -.Items to submit -==== -- Directory for project 3, containing an ipynb file and a python file as described above. -- Module and function level docstrings where appropriate in the python file. -- Documentation generated by Sphinx, as instructed above. -==== - - -=== Question 2 (3 pts) -[upperalpha] -.. Write a function called `get_avg_heart_rate` to get the average heart rate for a given date from our watch data. -.. Write a function called `get_median_heart_rate` to find median heart rate for a given date from our watch data. -.. Write a function called `graph_heart_rate` to create a box-and-whisker plot of heart rate for a given date from our watch data. -.. Give each function an appropriate docstring. -.. Run each function for April 4th, 2019 in your Jupyter notebook to prove they work. Ensure you add them to project03-key.py. -.. Regenerate your documentation, and view the results in a new tab. - -[NOTE] -==== -While you could redefine all of your logic to get data for a given date, it would be much easier to simply reuse the function you wrote in the previous project within your new functions. -==== - -[TIP] -==== -Feel free to use library functions for the above functions (i.e. statistics for mean and median and matplotlib for plotting) -==== - -You can test your code using the following code in your Jupyter notebook: - -[source,python] ----- -date_records = get_records_for_date(tree, for_date) -print(f"Average: {format(get_avg_heart_rate(date_records),'.2f')}") -print(f"Median : {format(get_median_heart_rate(date_records),'.2f')}") -graph_heart_rate(date_records) - -# This should output values in a format similar to the following: -# Average: 86.25 -# Median : 83.00 -# The box and whisker plot should reflect what you see in the average/median measures. Feel free to write an extra function to get standard deviations or quartiles for a more accurate way to check your work is correct. ----- - -.Items to submit -==== -- 3 functions, named and as described above, including function-level docstrings. -- Outputs of running the functions on April 4th, 2019. -- Documentation generated by Sphinx, as instructed above. -==== - - -=== Question 3 (3 pts) -[upperalpha] -.. Create your own README.rst file in the `docs/source` folder. -.. regenerate your documentation, and take a picture of the resulting webpage. - -One of the most important documents in any package or project is the `README` file. This file is so important that version control companies like GitHub and GitLab will automatically display it below the repositories contents. This file contains things like instructions on how to install the packages, usage examples, lists of dependencies, license links, etc. Check out some popular GitHub repositories for projects like `numpy`, `pytorch`, or any other repository you've come across that you believe does a good job explaining the project. - -In the `docs/source` folder, create a new file called `README.rst`. Choose 5 of the following "types" of reStructuredText from the https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html[this webpage], and create a fake README. The content can be https://www.lipsum.com/[Lorem Ipsum] type of content as long as it demonstrates 5 of the types of reStructuredText. - -- Inline markup -- Lists and quote-like blocks -- Literal blocks -- Doctest blocks -- Tables -- Hyperlinks -- Sections -- Field lists -- Roles -- Images -- Footnotes -- Citations -- Etc. - -[IMPORTANT] -==== -Make sure to include at least 1 https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections[section]. This counts as 1 of your 5 types of reStructuredText. -==== - -Once complete, add a reference to your README to the `index.rst` file. To add a reference to your `README.rst` file, open the `index.rst` file in an editor and add "README" as follows. - -.index.rst -[source,rst] ----- -.. project3 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project3's documentation! -==================================== - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -[IMPORTANT] -==== -Make sure "README" is aligned with ":caption:" -- it should be 3 spaces from the left before the "R" in "README". -==== - -In a new `bash` cell in your notebook, regenerate your documentation. - -[source,ipython] ----- -%%bash - -cd $HOME/project03/docs -make html ----- - -Check out the resulting `index.html` page, and click on the links. Pretty great! - -[TIP] -==== -Things should look similar to the following images. - -image::figure35.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"] - -image::figure36.webp[Sphinx output, width=792, height=500, loading=lazy, title="Sphinx output"] -==== - -.Items to submit -==== -- Screenshot labeled "question03_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -- OR a PDF created by exporting the webpage. -==== - -.Items to submit -==== -[NOTE] -==== -When you submit your assignment, make sure that the .ipynb is viewable from within Gradescope. If it says something like (Large file hidden), you can submit the screenshots as PNGs (or any image format that works) as separate files on the assignment and then reference their names in the .ipynb. The bottom line is that we should be able to see each screenshot in Gradescope, _without_ having to download your project first. This is because asking our TAs to download hundreds of projects would be a bit rude. Please post any clarifying questions on Piazza and we can answer them. -==== - -For this project, please submit the following files: - -- The `.ipynb` file with: - - all functions throughout the project, demonstrated to be working as excpected. - - every different bash command used to call Sphinx at least once - - screenshots whenever we asked for them in a question - - Screenshots of each section of your webpage documentation (NOT inside your Jupyter notebook). -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project04.adoc deleted file mode 100644 index 785c2ae9e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project04.adoc +++ /dev/null @@ -1,154 +0,0 @@ -= TDM 30100: Project 4 -- 2023 - - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the second project in a 2-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -- Write and use code that serializes and deserializes data. -- Learn the pros and cons of various serialization formats. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/apple/health/watch_dump.xml` - -=== Setting Up: - -This project will be building off of the work we did last week in Project 03. Please feel free to copy over your code so that you can continue working on it. I would recommend doing this in a new `Project 04` directory so that you have two completely distinct versions of this project, one for each submission you will need for this class. If you did not complete Project 03, please go back to that project and understand it before attempting this one. You are also encouraged to talk (early in the week) with one or more of the TAs if you are confused. - -=== Question 1 (4 pts) -.. Add autodoc configuration to your `conf.py` file, regenerate your documentation, and take a picture of the resulting webpage. - -The `pdoc` package was specifically designed to generate documentation for Python modules using the docstrings _in_ the module. As you may have noticed, this is not "native" to Sphinx. - -Sphinx has https://www.sphinx-doc.org/en/master/usage/extensions/index.html[extensions]. One such extension is the https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html[autodoc] extension. This extension provides the same sort of functionality that `pdoc` provides natively. - -To use this extension, modify the `conf.py` file in the `docs/source` folder. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc' -] ----- - -Next, update your `index.rst` file so autodoc knows which modules to extract data from. - -[source,rst] ----- -.. project4 documentation master file, created by - sphinx-quickstart on Wed Sep 1 09:38:12 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Welcome to project4's documentation! -==================================== - -.. automodule:: firstname_lastname_project04 - :members: - -.. toctree:: - :maxdepth: 2 - :caption: Contents: - - README - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` ----- - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Not too bad! - -.Items to submit -==== -- Screenshot labeled "question04_results". Make sure you https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -- OR a PDF created by exporting the webpage. -==== - -=== Question 2 (4 pts) -.. Import the appropriate extensions so that Sphinx recognizes Google_style docstrings. -.. Create a new function, `graph_avg_heart_rate`, that graphs the average heart rate for all dates in our watch data. -.. Regenerate your documentation, and take a picture of the resulting webpage. - -Okay, while the documentation looks pretty good, clearly, Sphinx does _not_ recognize Google style docstrings. As you may have guessed, there is an extension for that. - -Add the `napoleon` extension to your `conf.py` file. - -[source,python] ----- -# -- General configuration --------------------------------------------------- - -# Add any Sphinx extension module names here, as strings. They can be -# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom -# ones. -extensions = [ - 'sphinx.ext.autodoc', - 'sphinx.ext.napoleon' -] ----- - -Next, we would like you to write a new function called `graph_average_heart_rate` that graphs the average heart rate for all dates in our watch data. The type of graph your function generates is up to you, but it should be a meaningful and well-labeled graphic that demonstrates something about the data (i.e. shape, outliers, etc.). Make sure to include a Google style docstring for your function. - -[TIP] -==== -When writing more complicated functions, think about the steps they need to do. For example, our function needs to do the following: - -for each date in our data: + -- get the records for that date + -- get the average heart rate for that date + -- add the average heart rate to a list of averages + - -Then, finally, graph the list of averages. - -I think simply by looking at this pseudocode in combination with the functions you wrote for previous questions, you should be able to get a good idea of how to structure and write this function. -==== - -In a new `bash` cell in your notebook, regenerate your documentation. Check out the resulting `index.html` page, and click on the links. Much better! Take a final screenshot of your `index.html` page, and include it in this question's submission section - -.Items to submit -==== -- function `graph_avg_heart_rate` with a Google style docstring. -- Regenerated final documentation to recognize Google style docstrings. -- Screenshot labeled "question05_results". Make sure you https://the-examples-book.com/projects/templates#including-an-image-in-your-notebook[include your screenshot correctly]. -==== - -// ==== Question 6 (1 pts) - -.Items to submit -==== -For this project, please submit the following files: - -- The `.ipynb` file with: - - all functions throughout the project, demonstrated to be working as expected. - - every different bash command used to call Sphinx at least once - - screenshots whenever we asked for them in a question - - An `.html` file with your newest set of documentation. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project05.adoc deleted file mode 100644 index fdf7379d3..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project05.adoc +++ /dev/null @@ -1,193 +0,0 @@ -= TDM 30100: Project 5 -- 2023 - -**Motivation:** Documentation is one of the most critical parts of a project. https://notion.so[There] https://guides.github.com/features/issues/[are] https://confluence.atlassian.com/alldoc/atlassian-documentation-32243719.html[so] https://docs.github.com/en/communities/documenting-your-project-with-wikis/about-wikis[many] https://www.gitbook.com/[tools] https://readthedocs.org/[that] https://bit.ai/[are] https://clickhelp.com[specifically] https://www.doxygen.nl/index.html[designed] https://www.sphinx-doc.org/en/master/[to] https://docs.python.org/3/library/pydoc.html[help] https://pdoc.dev[document] https://github.com/twisted/pydoctor[a] https://swagger.io/[project], and each have their own set of pros and cons. Depending on the scope and scale of the project, different tools will be more or less appropriate. For documenting Python code, however, you can't go wrong with tools like https://www.sphinx-doc.org/en/master/[Sphinx], or https://pdoc.dev[pdoc]. - -**Context:** This is the third project in a 3-project series where we explore thoroughly documenting Python code, while solving data-driven problems. - -**Scope:** Python, documentation - -.Learning Objectives -**** -- Use Sphinx to document a set of Python code. -- Use pdoc to document a set of Python code. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json` - -== Questions - - - -The listed datasets are fairly large, and interesting! They are `json` formatted data. Each _row_ of a single `json` file can be individually read in and processed. Take a look at a single row. - -[source,ipython] ----- -%%bash - -head -n 1 /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json ----- - -This is nice, because you can individually process a single row. Anytime you can do something like this, it is easy to break a problem into smaller pieces and speed up processing. The following demonstrates how you can read in a single line and process it. - -[source,python] ----- -import json - -with open("/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json") as f: - for line in f: - print(line) - parsed = json.loads(line) - print(f"{parsed['isbn']=}") - print(f"{parsed['num_pages']=}") - break ----- - -In this project, the overall goal will be to implement functions that perform certain operations, write the best docstrings you can, and use your choice of `pdoc` or `sphinx` to generate a pretty set of documentation. - -Begin this project by choosing a tool, `pdoc` or `sphinx`, and setting up a `firstname-lastname-project05.py` module that will host your Python functions. In addition, create a Jupyter Notebook that will be used to test out your functions, and generate your documentation. At the end of this project, your deliverable will be your `.ipynb` notebook and either a series of screenshots that captures your documentation, or a PDF created by exporting the resulting webpage of documentation. - -=== Question 1 (1pt) -[upperalpha] -.. In your Jupyter Notebook, run the example codes above, understand the data selected - -=== Question 2 (2 pts) -[upperalpha] - -.. Write a function called `scrape_image_from_url` that accepts a URL (as a string), and returns a `bytes` object of the data. -.. Write a function called `display_image_from_bytes` that display the image direclty without saving image to disk. - -Make sure `scrape_image_from_url` cleans up after itself and doesn't leave any image files on the filesystem. - -==== `scrape_image_from_url` - -. Create a variable with a temporary file name using the `uuid` package. -. Use the `requests` package to get the response. -+ -[TIP] -==== -[source,python] ----- -import requests - -response = requests.get(url, stream=True) - -# then the first argument to copyfileobj will be response.raw ----- -==== -+ -. Open the file and use the `shutil` packages `copyfileobj` method to copy the `response.raw` to the file. -. Open the file and read the contents into a `bytes` object. -+ -[TIP] -==== -You can verify a bytes object by: - -[source,python] ----- -type(my_object) ----- - -.output ----- -bytes ----- -==== -+ -. Use `os.remove` to remove the image file. -. Return the bytes object. - -==== `display_image_from_bytes` - -. Convert the byte data into a readable image format using an image processing library -. Open and display this readable image - -[TIP] -==== -[source, python] -from PIL import Image -from io import BytesIO -from IPython.display import display -==== - -You can verify your function works by running the following: - -[source,python] ----- -import shutil -import requests -import os -import uuid -import hashlib - -url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' -my_bytes = scrape_image_from_url(url) -m = hashlib.sha256() -m.update(my_bytes) -m.hexdigest() -display_image_from_bytes(my_bytes) ----- - -.output ----- -ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02 ----- -(image) - -=== Question 3 (2 pts) -[upperalpha] -.. Write a Python function called `top_reviewers` that reads file `Goodreads_reviews_parsed.json` and returns the IDs of the top 5 users who have provided the most reviews. - -The following shows how to test the function - -[source,python] ----- -filename = "Goodreads_reviews_parsed.json" -print(top_reviewers(filename)) ----- - -[NOTE] -==== -.. When you run this code with the provided sample JSON file, the top_reviewers function will print out the IDs of the top 5 users with the most reviews. -.. If there are ties in the number of reviews, it will return the users that appear first in the file. -==== - - -=== Question 4 (2 pts) - -[upperalpha] -.. Create a new function, that does something interesting with one or more of these datasets. Just like _all_ the previous functions, make sure to include detailed and clear docstrings. - - - -=== Question 5 (1 pt) -[upperalpha] -.. Generate your final documentation, and assemble and submit your deliverables: - -- Screenshots and/or a PDF exported from your resulting documentation web page - - -Project 05 Assignment Checklist -==== -* Jupyter `.ipynb` file with your codes, comments and outputs for the assignment - ** `firstname-lastname-project05.ipynb`. -* Screenshots and/or a PDF exported from your resulting documentation web page.to show your outputs -* A html file `.htm` with your newest set of documentation. -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project06.adoc deleted file mode 100644 index 27be51bdc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project06.adoc +++ /dev/null @@ -1,173 +0,0 @@ -= TDM 30100: Project 6 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this next series of projects, particularly the processes. - -**Context:** We are about to dive straight into a series of projects that emphasize working with images (with other fun things mixed in). We will start out with a straightforward task, that will involve lots of visual, manual analyses of images after you modify them to be easier to analyze. Then, in future projects, we will start to use computer vision to do this analysis for of us. - -**Scope:** Python, images, openCV, skimage - -.Learning Objectives -**** -- Use `numpy`, `skimage`, and `openCV` to process images. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -== Questions - -=== Question 1 (2 pts) -[upperalpha] -.. Write code to read in the image and display it in your notebook. -.. Write code to find the shape of the image. - -Let's ease into things by first taking a look at the image we are going to analyze for this project. First, read up on https://www.geeksforgeeks.org/matplotlib-pyplot-imshow-in-python/[this] matplotlib documentation on image processing, and then write a small snippet of Python code in order to read in our image and display it in our notebook. - -[TIP] -==== -Don't forget to run the below code in order to import the proper libraries for this project. -==== - -[source,python] ----- -import cv2 -import matplotlib.pyplot as plt -import matplotlib.image as mpimg -import numpy as np ----- - -[TIP] -==== -The functions `imread` and `imshow`, from matplotlib.image and matplotlib.pyplot respectively, will be useful for this question. -==== - -If you take a look at `img`, you will find it is simply a multidimensional `numpy.ndarray`, with a somewhat strange shape. We will discuss this shape more in Question 2, but for now you can note that the first two dimensions given are the height and width of the image, in pixels. Keep the third one in mind for now, we will discuss it later. - -For the last part of this question, write some code to print the shape of the image. What are the dimensions of the image? How many pixels wide and tall is it? -.Items to submit -==== -- Code to read in and display our image. -- Code used to print shape, and height and width of our image. -==== - - -=== Question 2 (2 pts) -[upperalpha] -.. Using openCV with two different methods, grayscale the image. -.. Find the shape of the grayscale image. -.. Write one to two sentences explaining any differences between this image shape and the shape you identified in the previous question. - -Now that we are familiar with the image we are working with, let's get started modifying it with the end goal of eventually making it easy to manually count how many balls of each color are in our image. - -First off, let's convert our image to grayscale. This is a good first step when analyzing an image, as it can give you an idea of the 'black-white contrast' for an image, which is often very useful in something referred to as 'contour-edge detection'. We will learn more about contour-edge detection, how to perform it, and what it is useful for later on in this course. - -Read through https://www.geeksforgeeks.org/python-grayscaling-of-images-using-opencv/[this short article] on how to do grayscaling of images with openCV. Then, using two different methods, convert the image to grayscale. Note that both of the methods of question are contained in the article provided. - -[TIP] -==== -The functions `imread` and `cvtColor` from openCV will be useful for this question, with the latter in conjunction with the `cv2.COLOR_BGR2GRAY` constant. -==== - -Once you've done this, print the image along with the shape of the image. How does this shape differ from the shape of the original image? What do you think the dimensions of the grayscale image represent? - - -.Items to submit -==== -- Code to grayscale the image, two different ways -- Printed image, grayscaled, -- Shape of grayscaled image, and explanation of what the dimensions represent. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Code to split the image into red, green, and blue color channels. -.. Code to display each channel (should be grayscale). -.. 1-2 sentences about our grayscale images and their usefulness in determining colors. - -While we are on the topic of color, let's take a look at the color channels of our image and how we can best analyze them individually. After all, outside of edge detection, you will likely want to talk about the different colors present in images you are analyzing. - -Read through https://www.geeksforgeeks.org/python-splitting-color-channels-opencv/[this short article] on how to split an image into its color channels with openCV. Then, write some code to split our image into its red, green, and blue color channels. Then, display each of the channels individually. You should see three grayscale images, each with slight but clearly noticeable differences from the others. - -From these images, do you think that it would be possible to determine which color ball is most common? Write a sentence or two discuss why or why not. - -.Items to submit -==== -- Code to split the image into its RGB color channels, and display each channel. -- 1-2 sentences about our grayscale images and their usefulness in determining colors. -==== - -=== Question 4 (2 pts) -[upperalpha] -.. Code to recolor our images into their respective colors. -.. Code to display each channel (should be colored). -.. Result of running provided code snippet to create red mask. - -[NOTE] -==== -You may notice when you first attempt this question that the colors are not matching up with what you expect. This is due to a difference in formatting between openCV and matplotlib, where openCV uses BGR instead of RGB. You can fix this by using the `cv2.cvtColor` function with the `cv2.COLOR_BGR2RGB` constant, similar to how you used it to grayscale images in question 2. -==== - -Next, write some code to recolor each of the channels with its respective color, and display the colored images. You should see three images, each with a different color tint. Note that the colors may not be exactly what you expect, but they should be close. This can be done by creating another channel (a simple numpy array) of all zeroes, and then copying your channel into the proper dimension of the numpy array before displaying it with `imshow` as usual. - -Here is an example of how to do this with the red channel, if you're getting stuck: - -[source,python] ----- -blank = 255 * (r_c.copy() * 0) - -# r_c represents the red channel from the last question -red_image = cv2.merge([blank, blank, r_c]) -plt.imshow(plt.imshow(cv2.cvtColor(red_image, cv2.COLOR_BGR2RGB)), plt.title('Red Channel')) ----- - -Finally, run the following code after you have shown your color images. This will create something called a `color mask`, which you will find is much more useful in determing the most common color of ball in our image. - -[source,python] ----- -# Define lower and upper bounds for red color in BGR format -lower_red = np.array([100, 0, 0]) # Lower bound -upper_red = np.array([255, 100, 100]) # Upper bound - -# Create a mask for red pixels -red_mask = cv2.inRange(img, lower_red, upper_red) - -# Apply the red mask to the original image -red_pixels = cv2.bitwise_and(img, img, mask=red_mask) - -plt.figure(figsize=(12, 4)) # Create a larger figure for better visualization -plt.subplot(131), plt.imshow(red_pixels), plt.title('Red Masked') -plt.subplot(132), plt.imshow(img), plt.title('Original Image') ----- - -.Items to submit -==== -- Code to recolor each channel, and display each channel. -- 1-2 sentences about our colored images, their usefulness/shortcomings in analyzing color, and how they could be improved upon. -==== - -=== Submitting your Work -Nicely done, you've made it to the end of Project 6! This is likely a very new topic for many of you, so please take the time to get things right now and learn all of the core concepts before we move on to more advanced topics in the next project. Unlike for most of your other projects, it is actually okay if you get the 'File to large to display' error in Gradescope. We will be excusing it for this project due to the nature of wanting to display a lot of images in our notebook. Just make sure that if you redownload your .ipynb file from Gradescope, it contains everything you expect it to. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project06.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project07.adoc deleted file mode 100644 index e0e23b8d8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project07.adoc +++ /dev/null @@ -1,116 +0,0 @@ -= TDM 30100: Project 7 -- 2023 -:page-mathjax: true - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous project, we learned to manipulate image's basic factors by functions from the openCV `cv2` module. In this project, we will understand key image features, detect color dominance, and perform enhancing the image's visual quality by histogram equalization technique - -**Scope:** Python, images, openCV, Histogram equalization - -.Learning Objectives -**** -- Process images using `numpy`, `matplotlib`, and `openCV`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] - -.. Let's work with our `ballpit.jpg` again. In project 06, we split the image into its color channels (red, green and blue). With outputs for its color channels, please find out the average values of intensity for each channel -.. Display the average values for each channel with a bar chart. Briefly explain what is your finding from the bar chart - -[NOTE] -==== -* The average pixel values for the 3 channels can show the whole brightness of the image, reveal which color is dominant in the image, as well as image temperature - warm (reddish), cool(blueish) -* The average values of intensity of an image is calculated by summing up the intensity values of all pixels and dividing by the total number of pixels. Intensity is the value of a pixel. For a grayscaled image, the intensity has value from Black to white; for a color image in RGB, and each pixel has 3 intensity values, for R,G and B respectively. -==== -[TIP] -==== -* The average value can be calculated using numpy `mean()` -==== - -=== Question 2 (2 pts) - -.. In project 06, you created a red mask for red pixels and applied the red mask to the original image. Please create another 2 masks for green and blue channels. -.. Please identify how many pixels in the image are red, green and blue (respectively), and visualize the number of pixels for the 3 channels using a combined Histogram. Briefly explain what you found from the diagrams. - -[NOTE] -==== -A combined histogram here means a chart with 3 bars for the 3 channels respectively. The x-axis is the 3 channels, and the y-axis is the number of pixels for each channel. -==== - -[NOTE] -==== -* The summaries for each channel state the number of pixels for each color. So if `blue` has largest number, we can say blue is the dominant color of the image. -* To define lower and upper bounds for 3 colors depends on your personal judgement. You may need to adjust those thresholds value according to the different images and different purpose of your task. -==== -[TIP] -==== -`numpy sum()` can be used to summarize pixels -==== - -=== Question 3 (2 pts) - -[loweralpha] -.. Write a function called `equal_histogram_gray` using the histogram equalized technique: -... The function will accomplish a way to enhance image area that is too dark or light by adjusting the intensity values; it will only consider intensity but not any color information. -... The input argument to the function is an image -... The function returns a tuple of two images: one is the grayscaled image, and the other is a histogram-equalized grayscaled image - -.. Run the function with "ballpit.jpg" as input. Visualize the 2 output images aligning with the original "ballpit.jpg" using a Histogram chart - -[NOTE] -==== -`Histogram equalization` is a technique in `digital image processing`. It is a process where the intensity values of an image are adjusted to create a higher overall contrast. -`Digital Image Processing` is a significant aspect of data science. It is used to enhance and modify images so that their attributes are more easily understand. - -You may refer to more information about `Histogram Equalization` from the following website -https://www.educative.io/answers/what-is-histogram-equalization-in-python - -==== -[TIP] -==== -* The following 2 ways can be used to convert the image "ballpit.jpg" to grayscaled image -[source,python] -import cv2 -import matplotlib.image as mpimg -IMAGE = '/anvil/projects/tdm/data/images/ballpit.jpg' -img = mpimg.imread(IMAGE) -gray_img1 = cv2.imread(IMAGE, 0) -gray_img2= cv2.cvtColor(img.copy(), cv2.COLOR_BGR2GRAY) - -* The `cv2.equalizeHist()` function will be useful to solve the question. -==== - -=== Question 4 (2 pts) - -[loweralpha] -.. Process one of your favorite photos with the function `equal_histogram_gray`. Write 1-2 sentences about your input and output. Make sure to show the result of the images. - -Feel free to use `/anvil/projects/tdm/data/images/coke.jpg` -- the results are pretty neat! - - -Project 07 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project07.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project08.adoc deleted file mode 100644 index 14cf47c40..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project08.adoc +++ /dev/null @@ -1,93 +0,0 @@ -= TDM 30100: Project 8 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous projects, we worked with images and implemented image Histogram Equalization, with some pretty cool results! In this project, we will continue to work with images key features, introduce YCbCr color space, and perform enhancing the image's visual quality by histogram equalization technique with colors - -**Scope:** Python, images, openCV, Histogram equalization, YCbCr, image digital fingerprint - -.Learning Objectives -**** -- - Process images using `numpy`, `matplotlib`, and `openCV`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -[NOTE] -==== -As in our previous projects, by default, a image is read in as a `RGB` image, where each pixel is represented as a value between 0 and 255, R represents "red", G represents "green", and B represents "blue". While it is natural for display with RGB, image with `YCbCr` format has advantages in many image processing, compression situations etc. - -`YCbCr` is a color space used in image processing. Y stands for "Luminance", Cb stands for "Chrominance Blue", Cr stands for "Chrominance red". You may get more information for `YCbCr` from https://en.wikipedia.org/wiki/YCbCr[YCbCr] - -`YCbCr` can be derived from the RGB color space. There are several Python libraries can be used to do the conversion, in this project we will use cv2 from OpenCV -[source, python] -import cv2 -rgb_img=cv2.imread(('/anvil/projects/tdm/data/images/ballpit.jpg')) -ycbcr_img = cv2.cvtColor(img,cv2.COLOR_BGR2YCrCb) -==== - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] -.. Please split `/anvil/projects/tdm/data/images/ballpit.jpg` into its `YCbCr`components and display them - - -[TIP] -==== -To display the YCbCr Y component, you will need to set the Cb and Cr components to 127. To display the Cb component, you will need to set the Cr and Y components to 127, etc. -==== - -[NOTE] -==== -The human eye is more sensitive to luminance than to color. As you can tell from the previous question, the Y component captures the luminance, and contains the majority of the image detail that is so important to our vision. The other Cb and Cr components are essentially just color components, and our eyes aren't as sensitive to changes in those components. -Luminance shows the brightness of an image. An RGB image can be converted to a YCbCr image. The histogram equalization then can apply to the luminance without impacting the color channels (Cb and Cr channels), which, if histogram equalization directly applies to an RGB image, it may cause image artifacts issues. "Artifacts issues" refers to unwanted distortion in an image. -Let's process some images in the following questions to makes this explicitly clear -==== - -=== Question 2 (2 pts) - -[loweralpha] -.. Please write a function named `equal_hist_rgb` to do Histogram Equalization directly to an image with RGB format. The parameter will be an image. The returns will be a Histogram Equalized colored image. Run the function with input `ballpit.jpg`. Show the output Histogram Equalized colored image. - - -=== Question 3 (2 pts) -[loweralpha] - -.. Please write a function named `equal_hist_YCrCb` that applies Histogram Equalization to an image, so that first the image will be converted from RGB format to YCrCb format, then apply Histogram Equalization. The parameter will be an image. The returns will be a Histogram Equalized colored image. Run the function with image `ballpit.jpg`. Show the output Histogram Equalized colored image. - -[TIP] -==== -We can read a 3-chanel RGB image by both `openCV cv2` and `matplotlib.image`. However, please do notice the output for cv2 is in BGR order but for matplotlib.image is in RGB order. - -`cv2.split()` will be useful to split the image to 3 channels -`cv2.equalizeHist()` will be useful to do histogram equalization. -`cv2.merge()` will be useful to combine all channels back to an equalized image - -==== -=== Question 4 (2 pts) - -[loweralpha] -.. Please plot the original image of `ballpit.jpg`, output images of it from question 2 and question 3 as a combined chart. What is your conclusion? - - -Project 08 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project08.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project09.adoc deleted file mode 100644 index cd031897f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project09.adoc +++ /dev/null @@ -1,156 +0,0 @@ -= TDM 30100: Project 9 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this project. - -**Context:** In the previous project, we were able to isolate and display the Y, Cb and Cr channels of our `ballpit.jpg` image, and we applied an image histogram equalization technique to Y and then merged 3 components, to an equalized image. We understood the structure of an image and how the image's luminance (Y) and chrominance (Cb and Cr) contributed to the whole image. The human eye is more sensitive to the Y Channel than color channels Cb & Cr. In this project, we will continue to work with 'YCbCr` images as we delve into some image compression techniques, we will implement a variation of jpeg image compression! - -**Scope:** Python, images, openCV, YCbCr, downsampling, discrete cosine transform, quantization - -.Learning Objectives -**** -- Be able to process images compression utilizing using `openCV` -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -== Questions - -[NOTE] -==== -Some helpful links that are really useful. - -- https://en.wikipedia.org/wiki/JPEG -- https://en.wikipedia.org/wiki/Quantization_(image_processing) -- https://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf (if you are interested in Huffman coding) - -JPEG is a _lossy_ compression format and an example of transform based compression. Lossy compression means that you can't retrieve the information that was lost during the compression process. In a nutshell, these methods use statistics to identify and discard redundant data. -==== - -[NOTE] -==== -Since the human eye is more sensitive to the Y Channel than color channels, we can reduce the resolution of the color components to achieve image compression. -we will first need to import some libraries -[source,python] -import cv2 -import numpy as np -import matplotlib.pyplot as plt - -To read the image, we will use openCV `cv2` -[source, python] -ballpit_bgr= cv2.imread('/anvil/projects/tdm/data/images/ballpit.jpg') - -Then convert the image from default rgb format to YCrCb format -[source,python] -ballpit_ycrcb = cv2.cvtColor(ballpit_bgr,cv2.COLOR_BGR2YCrCb) -==== -=== Question 1 (2 pts) -[loweralpha] - -First we will use a downsample technique, to compress an image by reducing the resolution of the color channels. It will return a YCrCb image with lower resolution. - -The following statement downsamples the image Cr channel to half (0.5) by using `cv2.resize` -[source,python] -ballpit_reduce = cv2.resize(ballpit_ycrcb[:,:,1],(0,0),fx=0.5,fy=0.5) - -Then we will need to use `cv2.resize()` to upsample the resolution reduced image to the original size by using the original image size's tuple -[source, python] -cv2.resize(ballpit_reduce,(ballpit_ycrcb.shape[1],ballpit_ycrcb.shape[0])) - -.. Please write a function named compress_downsample, it will take 3 arguments, a `jpg` file, a float number (fx) for the width downsampling factor; a float number (fy) for the Height downsampling factor. The returns will be a compressed ( downsampled ) image -.. Visualize the compressed image aligned with original image -.. Calculate the compression ratio - -[TIP] -You may use `cv2.imwrite` to save the compressed image to a file, get the size of it and divide by size of original image file - -=== Question 2 (2 pts) - -Second let's look into the discrete cosine transform technique -[NOTE] -Per https://www.mathworks.com/help/images/discrete-cosine-transform.html[MathWorks], the discrete cosine transform has the property that visually significant information about an image is concentrated in just a few coefficients of the resulting signal data. Meaning, if we are able to capture the majority of the visually-important data from just a few coefficients, there is a lot of opportunity to _reduce_ the amount of data we need to keep. So DCT is a technique allow the important parts of an image separated from the unimportant ones. - -E.g. -We will need to split the previous created `ballpit_ycrcb` into 3 Channels -[source,python] -y_c, cr_c,cb_c = cv2.split(ballpit_ycrcb) - -Next, apply 2D DCT to each channel by `cv2.dct` -[source,python] -y_c_dct = cv2.dct(y_c.astype(np.float32)) -cr_c_dct = cv2.dct(cr_c.astype(np.float32)) -cb_c_dct = cv2.dct(cb_c.astype(np.float32)) - -.. Please find the dimension for the output DCT blocks -.. Please print a 8*8 DCT blocks for each channel separately - -[TIP] -==== -* `.astype` is a method to convert numpy array to a certain data type. -* `np.flfoat32` is a data type of 32-bit floating point numbers array -* `shape` will be useful for the block dimensions -==== - -=== Question 3 (2 pts) - -Now let us try to visualize the output of DCT compression. One common way to do it will be to set value zero to some of the DCT coefficients, such as high-frequency ones at right or downward in the DCT output matrix, for example if we only want to keep top-left of 50*50 block of coefficients. We can set the value to zero to all other areas. For example, for the Y channel, -[source, python] -cut_v = 50 -y_c_dct[cut_v:,:]=0 -y_c_dct[:,cut_v:]=0 - -After updating the DCT coefficients, we can do inverse DCT on each channel to change back to its pixel intensities from its frequency representation, for example for Y channel -[source, python] -y_rec = cv2.idct(y_c_dct.astype(np.float32)) - -.. Please create a function named `compress_DCT` to implement image compression with DCT. The arguments are a jpg image, and a number for the coefficient area you would like to keep (we only need to consider same size for horizontal and vertical directions) -.. Visualize the DCT compressed image for ballpit.jpg align with the original one -.. Calculate the compression ratio - -=== Question 4 (2 pts) - -Next, let us try a quantization technique. Quantization reduces the precision of the DCT coefficients based on human perceptual characteristics. This introduces data loss, but reduces image size greatly. You can read more about quantization https://en.wikipedia.org/wiki/Quantization_(image_processing)[here]. Apparently, the human brain is not very good at distinguishing changes in high frequency parts of our data, but good at distinguishing low frequency changes. - -We can use a quantization matrix to filter out the higher frequency data and maintain the lower frequency data. One of the more common quantization matrix is the following. - -[source,python] ----- -q1 = np.array([[16,11,10,16,24,40,51,61], - [12,12,14,19,26,28,60,55], - [14,13,16,24,40,57,69,56], - [14,17,22,29,51,87,80,62], - [18,22,37,56,68,109,103,77], - [24,35,55,64,81,104,113,92], - [49,64,78,87,103,121,120,101], - [72,92,95,98,112,100,103,99]]) - ----- -We can quantize the DCT coefficients by dividing the value from quantization matrix and rounding to integer. For example for Y channel - -[source,python] -np.round(y_c_dct/q1) - -.. Please create a function called `compress_quant` that will use the function from question 3, select a 8*8 block and quantize the DCT coefficients before we do DCT inversion -.. Run the function with image ballpit.jpg, visualize the output compressed image align with original one -.. Calculate the compression ratio - - - -Project 09 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project09.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project10.adoc deleted file mode 100644 index 8600d4988..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project10.adoc +++ /dev/null @@ -1,172 +0,0 @@ -= TDM 30100: Project 10 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well. - -**Context:** This is the first in a series of web scraping projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous Data Mine courses. For this first project, we will start slow with a `selenium` review with a small scraping challenge. - -**Scope:** selenium, Python, web scraping - -.Learning Objectives -**** -- Use selenium to interact with a web page prior to scraping. -- Use selenium and xpath expressions to efficiently scrape targeted data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) -[loweralpha] -The following code provides you with both a template for configuring a Firefox browser selenium driver that will work on Anvil, as well as a straightforward example that demonstrates how to search web pages and elements using xpath expressions, and simulate mouse clicks. Take a moment, run the code, and refresh your understanding. - -[source,python] ----- -import time -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -from selenium.webdriver.common.keys import Keys ----- - -[source,python] ----- -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) ----- - -[source,python] ----- -# navigate to the webpage -driver.get("https://books.toscrape.com") - -# full page source -print(driver.page_source) - -# get html element -e = driver.find_element("xpath", "//html") - -# print html element -print(e.get_attribute("outerHTML")) - -# find the 'Music'link in the homepage -link = e.find_element("xpath", "//a[contains(text(),'Music')]") -# click the link -link.click() -# We can delay the program to allow the page to load -time.sleep(5) -# get new root HTML element -e = driver.find_element("xpath",".//html") - # print html element -print(e.get_attribute("outerHTML")) - ----- - -.. Please use `selenium` to get and display the first book's title and price in the Music books page -.. At same page, try to find book titled "How music works" then `click` this book link and then scrape and print book information: product description, upc and availability - -Take a look at the page source -- do you think clicking the book link was needed in order to scrape that data? Why or why not? - -[NOTE] -==== -You may get more information about `xpath` here: https://www.w3schools.com/xml/xpath_intro.asp [xpath] -==== - - -=== Question 2 (6 pts) - -Okay, Now, let us look into a popular website of housing market. https://zillow.com has extremely rich data on homes for sale, for rent, and lots of land. - -Click around and explore the website a little bit. Note the following. - -. Homes are typically list on the right hand side of the web page in a 21x2 set of "cards", for a total of 40 homes. -+ -[NOTE] -==== -At least in my experimentation -- the last row only held 1 card and there was 1 advertisement card, which I consider spam. -==== -. If you want to search for homes for sale, you can use the following link: `https://www.zillow.com/homes/for_sale/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: `https://www.zillow.com/homes/for_sale/lafayette-in_rb` -. If you want to search for homes for rent, you can use the following link: `https://www.zillow.com/homes/for_rent/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: `https://www.zillow.com/for_rent/lafayette-in_rb` -. If you load, for example, https://www.zillow.com/homes/for_rent/lafayette-in_rb and rapidly scroll down the right side of the screen where the "cards" are shown, it will take a fraction of a second for some of the cards to load. In fact, unless you scroll, those cards will not load, and if you were to parse the page contents, you would not find all 40 cards are loaded. This general strategy of loading content as the user scrolls is called lazy loading. - -.. Write a function called `get_properties_info` that, given a `search_term` (zipcode), will return a list of property information include zpid, price, number of bedroom, number of bathroom and square footage (sqft) . The function should both get all of the cards on a page, but cycle through all of the pages of homes for the query. - -[TIP] -==== -The following was a good query that had only 2 pages of results. - -[source,python] ----- -properties_info = get_properties_info("47933") ----- -==== - -[TIP] -==== -You _may_ want to include an internal helper function called `_load_cards` that accepts the driver and scrolls through the page slowly in order to load all of the cards. - -https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python[This] link will help! Conceptually, here is what we did. - -. Get initial set of cards using xpath expressions. -. Use `driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])` to scroll to the last card that was found in the DOM. -. Find cards again (now that more may have loaded after scrolling). -. If no more cards were loaded, exit. -. Update the number of cards we've loaded and repeat. -==== - -[TIP] -==== -Sleep 5 seconds using `time.sleep(5)` between every scroll or link click. -==== - -[TIP] -==== -After getting the information for each page, use `driver.delete_all_cookies()` to clear off cookies and help avoid captcha. -==== - -[TIP] -==== -If you using the link from the "next page" button to get the next page, instead, use `next_page.click()` to click on the link. Otherwise, you may get a captcha. -==== - -[TIP] -==== -Use something like: - -[source,python] ----- -with driver as d: - d.get(blah) ----- - -This way, after exiting the `with` scope, the driver will be properly closed and quit which will decrease the likelihood of you getting captchas. -==== - -[TIP] -==== -For our solution, we had a `while True:` loop in the `_load_cards` function and in the `get_properties_info` function and used the `break` command in an if statement to exit. -==== - - -Project 10 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project10.ipynb`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project11.adoc deleted file mode 100644 index c386089d0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project11.adoc +++ /dev/null @@ -1,164 +0,0 @@ -= TDM 30100: Project 11 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. We will continue to use "books.toscrape.com" to practice scraping skills - -**Context:** This is a second project focusing on web scraping combined with the BeautifulSoup library - -**Scope:** Python, web scraping, selenium, BeautifulSoup - -.Learning Objectives -**** -- Use Selenium and XPath expressions to efficiently scrape targeted data. -- Use BeautifulSoup to scrape data from web pages -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - - -In the previous project, you learned how to get the 'Music' category link in the webpage of "books.toscrape.com", and how to use `Selenium` to scrape books' information. The follow is the sample code for the solution for question 1. - -[source,python] ----- -import time -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.by import By - -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) - -driver.get("https://books.toscrape.com") -e_t = driver.find_element("xpath",'//article[@class="product_pod"]/h3/a') -e_p = driver.find_element("xpath",'//p[@class="price_color"]') -fst_b_t = e_t.text -fst_b_p =e_p.text - -# find book entitled "how music works" -book_link = driver.find_element(By.LINK_TEXT, "How Music Works") -book_link.click() -time.sleep(5) - -#scrape and print book information : product description, UPC and availability -product_desc=driver.find_element(By.CSS_SELECTOR,'meta[name="description"]').get_attribute('content') -product_desc -table = driver.find_element(By.XPATH, "//table[@class='table table-striped']") -upc= table.find_element(By.XPATH, ".//th[text()='UPC']/following-sibling::td[1]") -upc_value = upc.text -upc_value - -availability = table.find_element(By.XPATH, ".//th[text()='Availability']/following-sibling::td[1]") -availability_value = availability.text -availability_value -driver.quit() ----- -[NOTE] -In this project we will include BeautifulSoup in our webscraping journey. BeautifulSoup is a python library. You can use it to extract data from HTML or XML files. You may find more BeautifulSoup information here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ - -== Questions - -=== Question 1 (2 pts) - -.. Please create a function called "get_category" to extract all categories' names in the website. The function does not need any arguments. The function returns a list of categories' names. - -[TIP] -==== -* Use BeautifulSoup for this question -[source,python] -from bs4 import BeautifulSoup -==== -[TIP] -==== -* You can parse the page with BeautifulSoup -[source,python] -bs = BeautifulSoup(driver.page_source,'html.parser') -==== -[TIP] -==== -* Review the page source of the website's homepage, including categories located at the sidebar. The BeautifulSoup "select" method is useful to get names, like this: - -[source,python] -categories = [c.text.strip() for c in bs.select('.nav-list li a')] -==== - -=== Question 2 (2 pts) - -.. Please create a function called "get_all_books" to get first page books for a given category name from question 1. Use "Music_14" to test the function. The argument is a category name. The function returns a list of book objects with book titles, book price and book availability from the first webpage. - -[TIP] -==== -* Review the page source, you may find one "article" tag holds one book information. You may use find_all to find all "article" tags, like - -[source, python] -articles=bs.find_all("article",class_="product_pod") -==== - -[TIP] -==== -* You may create an object to hold the book information, like: -[source,python] -book = { - "title":title, - "price":price, - "availability":availability -} -==== - -[TIP] -==== -* You may use a loop to go through the books, like -[source,python] -for article in articles: - title = article.h3.a.attrs['title'] - price = article.find('p',class_='price_color').text - availability = article.find('p',class_='instock availability').text -# create a book object with the extract information - .... -==== -[TIP] -==== -* You may need a list to hold all book objects, and add all books to it, like -[source,python] -all_books=[] -... -all_books.append(book) -==== -[NOTE] -==== -* You may use different ways to solve the question, like use function "map" etc. -==== - -=== Question 3 (2 pts) - -You may have noticed that some categories like "fantasy_19" have more than one page of books. - -.. Please update the function "get_all_books" from question 2 so that the function can be used to get all books, even if there are multiple pages for the category. - -[TIP] -==== -* Look for pagination link "next" -==== - -=== Question 4 (2 pts) - -.. Look through the website "books.toscrape.com", and pick anything that interests you, and scrape and display those data. - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project11.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project12.adoc deleted file mode 100644 index ba9560f38..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project12.adoc +++ /dev/null @@ -1,184 +0,0 @@ -= TDM 30100: Project 12 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. We will continue use https://books.toscrape.com to practice scraping skills, visualize scrapped data, use `sqlite3` to save scrapped data to database - -**Context:** This is a third project focusing on web scraping combined with sqlite3 - -**Scope:** Python, web scraping, selenium, BeautifulSoup, sqlite3 - -.Learning Objectives -**** -- Visualize scraped data. -- Create tables for scraped data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -In the previous project, you have been able to scrape data from https://books.toscrape.com - -Now let's visualize your scrapped data - -.. Please visualize the books' price of music category with a bar plot. Split the prices into three price ranges: below 20, 20-30, above 30 - -[TIP] -==== -You may need to change the price to float, like -[source, python] -prices = [float(book['price'].replace('£','')) for book in books] - -books is the book list from the previous project's function "get_all_books", like this: - -books = get_all_books("Music_14") -==== -[TIP] -==== -You may use sum to group prices, like -[source,python] -price_less_20 = sum(1 for price in prices if price<20) -price_20_30 = sum(1 for price in prices if 30<=price<50) -... -==== -[TIP] -==== -You may use a bar chart, like -[source,python] -price_counts = [price_less_20, price_20_30,price_above_30] -labels = ["1","2","3"] -plt.bar(labels,price_counts,color=['purple','orange','green']) -# More plt settings and display statements -==== - -=== Question 2 (2 pts) - -.. Write `CREATE TABLE` statements to create 2 tables, namely, a `categories` table and a `books` table. - -[TIP] -==== -Check on the website for category information. The categories table may contain following fields -- 'id' a unique identifier for each category, auto increment -- 'category' like 'poetry_23' - -==== -[TIP] -==== -Check on the website for book information. The "books" table may contain following fields -- 'id' a unique identifier for each category, auto increment -- 'title' like 'A light in the Attic" -- 'category' like 'poetry_23' -- 'price' like 51.77 -- 'availability' like 'in stock(22 available)' - -==== - -[TIP] -==== -Use `sqlite3` to create the tables in a database called `$HOME/onlinebooks.db`. You can do all of this from within Jupyter Lab. - -[source,python] ----- -%sql sqlite:///$HOME/onlinebooks.db ----- - -[source,python] ----- -%%sql - -CREATE TABLE ... ----- - -Run the following queries to confirm and show your table schemas. - -[source, sql] ----- -PRAGMA table_info(categories); ----- - -[source, sql] ----- -PRAGMA table_info(books); ----- -==== - - -=== Question 3 (2 pts) - -.. Update the function "get_category" from project 11. After you get the information about categories from the website, populate the "categories" table with that data. -.. Run a couple of queries that demonstrate that the data was successfully inserted into the database. - -[TIP] -==== -Here is partial code to assist. - -[source,python] ----- -import sqlite3 -# connect to database -conn = sqlite3.connect('onlinebooks.db') -cur = conn.cursor() -for category in categories: - cur.execute('INSERT INTO CATEGORIES (CATEGORY) VALUES (?)',(category,)) -conn.commit() -conn.close() ----- -==== - -=== Question 4 (2 pts) - -.. Update the function "get_all_books" from project 11. After you get the information about books from from website, populate the "books" table with that data. You may need to scrap new data for a new field of "category" that the book belongs to. -.. Run a couple of queries that demonstrate that the data was successfully inserted into the database. - -[TIP] -==== -In project 11, we used an associate array to hold the book_info like this: - -[source,python] -book_info = { - ....book = { - "title":title, - "price":price, - "category":category_name, - "availability":availability -} -} - -We may need to use a different data structure like tuple to hold book information since we need to insert it to the books table, like this: -[source,python] -book_info =(title,price,category_name,availability) -==== -[TIP] -==== -Here is partial code to assist. - -[source,python] ----- -import sqlite3 - -... -# code to get book information -book_info =(title,price,category_name,availability) -# connect to database -conn = sqlite3.connect('onlinebooks.db') -cur = conn.cursor() -for article in articles: - cur.execute('INSERT INTO BOOKS (title,category,price,availability) VALUES (?,?,?,?)',book_info) -conn.commit() -conn.close() ----- -==== - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb` -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project13.adoc deleted file mode 100644 index a0abb3994..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project13.adoc +++ /dev/null @@ -1,216 +0,0 @@ -= TDM 30100: Project 13 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. This project we will learn some basics of containerization in a virtual environment using Alpine Linux. We first will start a virtual machine on Anvil, then create a simple container in the virtual machine. You may find more information about container and relationship between virtual machine and container here: https://www.redhat.com/en/topics/containers/whats-a-linux-container - -**Context:** The project is to provide very foundational knowledge about containers and virtualization, focusing on theoretical understanding and basic system interactions. - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use UNIX tools to effectively create a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (1 pt) - -[loweralpha] - -.. Logon to Anvil and use a bash command to find an available port you may use later. You only need to list 1 available port number. - -[TIP] -==== -- You may use the following code to find a port in the range 1025 to 65535. -- You may use a loop around the following code, to find an available port (instead of manually trying one by one), or you can find an available port in a different way, if you prefer. -[source, bash] ----- -if timeout 2 bash -c ">&/dev/tcp/127.0.0.1/1025" 2>/dev/null; then - echo "Port used" -else - echo "Port available" ----- -==== - -=== Question 2 (2 pts) - -.. Launch a virtual machine (VM) on Anvil. (Note that Docker is already pre-installed on Anvil.) Submit the output showing the job id and process id, after you start a virtual machine; it should look like this, for example; - -[source,bash] ----- -.output -[1] 3152048 ----- - -[NOTE] -==== -The most popular containerization tool at the time of writing is likely Docker. We will Launch a virtual machine on Anvil (which already has Docker pre-installed). - -Open up a terminal on Anvil. You may do it from within Jupyter Lab. Run the following code, to ensure that the SLURM environment variables don't alter or effect our SLURM job. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Next, let's make a copy of a pre-made operating system image. This image has Alpine Linux and a few basic tools installed, including: nano, vim, emacs, and Docker. - -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/builder.qcow2 $SCRATCH ----- - -Next, we want to acquire enough resources (CPU and memory) to not have to worry about something not working. To do this, we will use SLURM to launch a job with 4 cores and about 8GB of memory. - -[source,bash] ----- -salloc -A cis220051 -p shared -n 4 -c 1 -t 04:00:00 ----- - -Next, we need to make `qemu` available to our shell. Open a terminal and run the following code - -[source,bash] ----- -module load qemu -# check the module loaded -module list ----- - -Next, let's launch our virtual machine with about 8GB of memory and 4 cores. Replace the "[port]" with the port number that you got from question 1. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/builder.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::[port]-:22 & ----- - -[IMPORTANT] -==== -- [port] needs to be replaced with your port number -==== - -Next, it is time to connect to our virtual machine. We will use `ssh` to do this. - -[source,bash] ----- -ssh -p [port] tdm@localhost -o StrictHostKeyChecking=no ----- - -If the command fails, try waiting a minute and rerunning the command -- it may take a minute for the virtual machine to boot up. - -When prompted for a password, enter `purdue`. Your username is `tdm` and password is `purdue`. - -Finally, now that you have a shell in your virtual machine, you can do anything you want! You have superuser permissions within your virtual machine! -For this question, submit a screenshot showing the output of `hostname` from within your virtual machine! - -==== - - -=== Question 3 (1 pt) - -.. Exploring the virtual machine File System. Navigate the Alpine Linux file system and list the contents of the root directory. -.. List all running processes in the system -.. Display network configuration and test network connectivity. -[TIP] -==== -- You may refer to the following sample code or create your own ones - -[source, bash] ----- -ls / # list all root files ----- - -[source, bash] ----- -ps aux # system running processes ----- - -[source,bash] ----- -ifconfig # network interface configuration ----- - -==== - -=== Question 4 (2 pts) -.. Write and execute a simple shell script to display a message, like - -[TIP] -==== -[source, bash] ----- -echo 'Hello Your name, You are the Best!!!' > hello.sh ----- - -- run the shell script - -[source, bash] ----- -chmod +x hello.sh -./hello.sh ----- -==== - - -=== Question 5 (2 pts) - -After you complete the previous questions, you can see that you can use the virtual machine just like your own computer. Now let us follow the following step, to use Docker within the virtual machine to create and manage a container. Run all the commands in your terminal, copy the output to your jupyter notebook cells. - -.. List the docker version inside the virtual machine -[source, bash] ----- -docker --version ----- - -.. Pull the "hello-world" image from Docker Hub - -[source, bash] ----- -docker pull hello-world ----- - -..Run a container based on the "hello-world" image - -[source, bash] ----- -docker run hello-world ----- - -[NOTE] -==== -When the command runs, docker will create a container from the 'hello-world' image and run it. The container will display a message confirming that everything worked, and then it will exit. -==== - -.. List the container(s) with following command. It will provide you all the containers that are currently running or that exited already. -[source, bash] ----- -docker ps -a ----- - -.. After you confirm the container ran successfully, you may using following command to remove it - -[source, bash] ----- -docker rm [Container_id] ----- - -[TIP] -==== -Replace [Container_id] with the id that you got from previous question. -==== - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project14.adoc deleted file mode 100644 index 53ac8fc34..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-project14.adoc +++ /dev/null @@ -1,53 +0,0 @@ -= TDM 30100: Project 14 -- Fall 2023 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** Unix, SQLite, R, Python, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, modules, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, module, or package. - - -=== Question 3 (1 pt) - -.. Reflecting on data visualization questions that you have done, which one do you consider most appealing? Which specific package did you use to create it? Please provide an example from one question that you completed. You may refer to the question, and screenshot your graph. - -=== Question 4 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 5 (1 pt) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 6 (2 pts) - -.. Please identify 3 skills or topics related to data science that you want to learn. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-projects.adoc deleted file mode 100644 index 1b9affaf5..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/30100/30100-2023-projects.adoc +++ /dev/null @@ -1,45 +0,0 @@ -= TDM 30100 - -xref:fall2023/logistics/office_hours_301.adoc[[.custom_button]#TDM 301 Office Hours#] -xref:fall2023/logistics/301_TAs.adoc[[.custom_button]#TDM 301 TAs#] -xref:fall2023/logistics/syllabus.adoc[[.custom_button]#Syllabus#] - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$30100-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2023/tdm30100[Sign Up] - -=== Link - -https://piazza.com/purdue/fall2023/tdm30100/home[Homepage] - -== Syllabus - -See xref:fall2023/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project01.adoc deleted file mode 100644 index 7b4fde3ad..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project01.adoc +++ /dev/null @@ -1,470 +0,0 @@ -= TDM 40100: Project 1 -- 2023 - -**Motivation:** It's been a long summer! Last year, you got some exposure command line tools, SQL, Python, and other fun topics like web scraping. This semester, we will continue to work primarily using Python with data. Topics will include things like: documentation using tools like sphinx, or pdoc, writing tests, sharing Python code using tools like pipenv, poetry, and git, interacting with and writing APIs, as well as containerization. Of course, like nearly every other project, we will be be wrestling with data the entire time. - -We will start slowly, however, by learning about Jupyter Lab. In this project we are going to jump head first into The Data Mine. We will load datasets into the R environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -.Insider Knowledge -[%collapsible] -==== -IDE stands for Integrated Developer Environment: software that helps us program cleanly and efficiently. -==== - -**Context:** This is our first project as a part of The Data Mine. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** R, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Read about and understand computational resources available to you. -- Learn how to run R code in Jupyter Lab on Anvil. -- Read and write basic (.csv) data using R. -**** - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` -- `/anvil/projects/tdm/data/disney/flight_of_passage.csv` - -== Setting Up to Work - - -++++ - -++++ - - -This year we will be using Jupyter Lab on the Anvil cluster. Let's begin by launching your own private instance of Jupyter Lab using a small portion of the compute cluster. - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (including 2-factor authentication using Duo Mobile). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup -==== - -Towards the middle of the top menu, click on the item labeled btn:[My Interactive Sessions]. (Depending on the size of your browser window, there might only be an icon; it is immediately to the right of the menu item for The Data Mine.) On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, near the bottom of your screen, click on btn:[Jupyter Notebook]. (Make sure that you choose the Jupyter Notebook from "The Data Mine" section.) - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 1 CPU core and 1918 MB of memory. - -[NOTE] -==== -As you can see in the screenshot above, each core is associated with 1918 MB of memory. If you know how much memory your project will need, you can use this value to choose how many cores you want. In this and most of the other projects in this class, 1-2 cores is generally enough. -==== - -[NOTE] -==== -Please use 4 cores for this project. This is _almost always_ excessive, but for this project in question 3 you will be reading in a rather large dataset that will very likely crash your kernel without at least 3-4 cores. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on this button to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -seminar:: -The `seminar` kernel runs Python code but also has the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/current-projects/templates[our template page]. -==== - -seminar-r:: -The `seminar-r` kernel is intended for projects that **only** use R code. When using this environment, you will not need to prepend `%%R` to the top of each code cell. - -For now, let's focus on the `seminar` kernel. Click on btn:[seminar], and a fresh notebook will be created for you. - - -The first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/current-projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -There are 2 main types of cells in a notebook: code cells (which contain code which you can run), and markdown cells (which contain comments about your work). - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -[TIP] -==== -Make sure to read about and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -==== - - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. How many cores and how much memory (in GB) does Anvil's sub-cluster A have? (0.5 pts) -.. How many cores and how much memory (in GB) does your personal computer have? - - -++++ - -++++ - - -For this course, projects will be solved using the https://www.rcac.purdue.edu/compute/anvil[Anvil computing cluster]. - -Each _cluster_ is a collection of nodes. Each _node_ is an individual machine, with a processor and memory (RAM). Use the information on the provided webpages to manually calculate how many cores and how much memory is available for Anvil's "sub-cluster A". - -Take a minute and figure out how many cores and how much memory is available on your own computer. If you do not have a computer of your own, work with a friend to see how many cores there are, and how much memory is available, on their computer. - -[TIP] -==== -Information about the core and memory capacity of Anvil "sub-clusters" can be found https://www.rcac.purdue.edu/compute/anvil[here]. - -Information about the core and memory capacity of your computer is typically found in the "About this PC" section of your computer's settings. -==== - -.Items to submit -==== -- A sentence (in a markdown cell) explaining how many cores and how much memory is available to Anvil sub-cluster A. -- A sentence (in a markdown cell) explaining how many cores and how much memory is available, in total, for your own computer. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. Using Python, what is the name of the node on Anvil you are running on? -.. Using Bash, what is the name of the node on Anvil you are running on? -.. Using R, what is the name of the node on Anvil you are running on? - -++++ - -++++ - - -Our next step will be to test out our connection to the Anvil Computing Cluster! Run the following code snippets in a new cell. This code runs the `hostname` command and will reveal which node your Jupyter Lab instance is running on (in three different languages!). What is the name of the node on Anvil that you are running on? - -[source,python] ----- -import socket -print(socket.gethostname()) ----- - -[source,r] ----- -%%R - -system("hostname", intern=TRUE) ----- - -[source,bash] ----- -%%bash - -hostname ----- - -[TIP] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -Check the results of each code snippet to ensure they all return the same hostname. Do they match? You may notice that `R` prints some extra "junk" output, while `bash` and `Python` do not. This is nothing to be concerned about as different languages can handle output differently, but it is good to take note of. - -.Items to submit -==== -- Code used to solve this problem, along with the output of running that code. -==== - -=== Question 3 (1 pt) -[upperalpha] -.. Run each of the example code snippets below, and include them and their output in your submission to get credit for this question. - - -++++ - -++++ - - -[TIP] -==== -Remember, in the upper right-hand corner of your notebook you will see the current kernel for the notebook, `seminar`. If you click on this name you will have the option to swap kernels out -- no need to do this now, but it is good to know! -==== - -Practice running the following examples. - -python:: -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -SQL:: -[source, sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -[source, ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -bash:: -[source,bash] ----- -%%bash - -awk -F, '{miles=miles+$19}END{print "Miles: " miles, "\nKilometers:" miles*1.609344}' /anvil/projects/tdm/data/flights/subset/1991.csv ----- - -[NOTE] -==== -In the above examples you will see lines such as `%%R` or `%%sql`. These are called "Line Magic". They allow you to run non-Python code in the `seminar` kernel. In order for line magic to work, it MUST be on the first line of the code cell it is being used in (before any comments or any code in that cell). - -In the future, you will likely stick to using the kernel that matches the project language, but we wanted you to have a demonstration about "line magic" in Project 1. Line magic is a handy trick to know! - -To learn more about how to run various types of code using the `seminar` kernel, see https://the-examples-book.com/projects/current-projects/templates[our template page]. -==== - -.Items to submit -==== -- Code from the examples above, and the outputs produced by running that code. -==== - -=== Question 4 (1 pt) -[upperalpha] -.. Using Python, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts) -.. Using R, calculate how how much memory (in bytes) the A sub-cluster of Anvil has. Calculate how much memory (in TB) the A sub-cluster of Anvil has. (0.5 pts) - - -++++ - -++++ - - -[NOTE] -==== -"Comments" are text in code cells that are not "run" as code. They serve as helpful notes on how your code works. Always comment your code well enough that you can come back to it after a long amount of time and understand what you wrote. In R and Python, single-line comments can be made by putting `#` at the beginning of the line you want commented out. -==== - -[NOTE] -==== -Spacing in code is sometimes important, sometimes not. The two things you can do to find out what applies in your case are looking at documentation online and experimenting on your own, but we will also try to stress what spacing is mandatory and what is a style decision in our videos. -==== - -In question 1 we answered questions about cores and memory for the Anvil clusters. This time, we want you to convert your GB memory amount from question 1 into bytes and terabytes. Instead of using a calculator (or paper, or mental math for you good-at-mental-math folks), write these calculations using R _and_ Python, in separate code cells. - -[TIP] -==== -A Gigabyte is 1,000,000,000 bytes. -A Terabyte is 1,000 Gigabytes. -==== - -[TIP] -==== -https://www.datamentor.io/r-programming/operator[This link] will point you to resources about how to use basic operators in R, and https://www.tutorialspoint.com/python/python_basic_operators.htm[this one] will teach you about basic operators in Python. -==== - -.Items to submit -==== -- Python code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -- R code to calculate the amount of memory in Anvil sub-cluster A in bytes and TB, along with the output from running that code. -==== - -=== Question 5 (1 pt) -[upperalpha] -.. Load the "flight_of_passage.csv" data into an R dataframe called "dat". -.. Take the head of "dat" to ensure your data loaded in correctly. -.. Change the name of "dat" to "flight_of_passage", remove the reference to "dat", and then take the head of "dat" and "flight of passage" in order to ensure that your actions were successful. - - -++++ - -++++ - - -In the previous question, we ran our first R and Python code (aside from _provided_ code). In the fall semester, we will focus on learning R. In the spring semester, we will learn some Python. Throughout the year, we will always be focused on working with data, so we must learn how to load data into memory. Load your first dataset into R by running the following code. - -[source,ipython] ----- -%%R - -dat <- read.csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv") ----- - -Confirm that the dataset has been read in by passing the dataset, `dat`, to the `head()` function. The `head` function will return the first 5 rows of the dataset. - -[source,r] ----- -%%R - -head(dat) ----- - -[IMPORTANT] -==== -Remember -- if you are in a _new_ code cell on the , you'll need to add `%%R` to the top of the code cell, otherwise, Jupyter will try to run your R code using the _Python_ interpreter -- that would be no good! -==== - -`dat` is a variable that contains our data! We can name this variable anything we want. We do _not_ have to name it `dat`; we can name it `my_data` or `my_data_set`. - -Run our code to read in our dataset, this time, instead of naming our resulting dataset `dat`, name it `flight_of_passage`. Place all of your code into a new cell. Be sure there is a level 2 header titled "Question 5", above your code cell. - -[TIP] -==== -In markdown, a level 2 header is any line starting with 2 hashtags. For example, `Question X` with two hashtags beforehand is a level 2 header. When rendered, this text will appear much larger. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. -==== - -[NOTE] -==== -We didn't need to re-read in our data in this question to make our dataset be named `flight_of_passage`. We could have re-named `dat` to be `flight_of_passage` like this. - -[source,r] ----- -flight_of_passage <- dat ----- - -Some of you may think that this isn't exactly what we want, because we are copying over our dataset. You are right, this is certainly _not_ what we want! What if it was a 5GB dataset, that would be a lot of wasted space! Well, R does copy on modify. What this means is that until you modify either `dat` or `flight_of_passage` the dataset isn't copied over. You can therefore run the following code to remove the other reference to our dataset. - -[source,r] ----- -rm(dat) ----- -==== - -.Items to submit -==== -- Code to load the data into a dataframe called `dat` and take the head of that data, and the output of that code. -- Code to change the name of `dat` to `flight_of_passage` and remove the variable `dat`, and to take the head of `flight_of_passage` to ensure the name-change worked. -==== - -=== Question 6 (2 pts) - -++++ - -++++ - -Review your Python, R, and bash skills. For each language, choose at least 1 dataset from `/anvil/projects/tdm/data`, and analyze it. Both solutions should include at least 1 custom function, and at least 1 graphic output. - -[NOTE] -==== -Your `bash` solution can be both plotless and without a custom function. -==== - -Make sure your code is complete, and well-commented. Include a markdown cell with your short analysis (1 sentence is fine), for each language. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 7 (1 pt) - -++++ - -++++ - -The module system, `lmod`, is extremely popular on HPC (High Performance Computing) systems. Anvil is no exception and uses `lmod`! - -In a terminal, take a look at the modules available to you by default. - -[source,bash] ----- -module avail ----- - -Notice that at the very top, you'll have a list named: `/anvil/projects/tdm/opt/lmod`. - -Now run the following commands: - -[source,bash] ----- -module reset -module avail ----- - -Notice how the set of available modules changes! By default, we have it loaded up with some Datamine-specific modules. To manually load up those modules, run the following. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module avail ----- - -Notice how at the very top, there is a new section named `/anvil/projects/tdm/opt/core` with a single option, `tdm/default`. - -Go ahead and load up `tdm/default`. - -[source,bash] ----- -module load tdm -module avail ----- - -It looks like we are (pretty much) back to where we started off! This is useful to know in case there is ever a situation where you'd like to SSH into Anvil and load up our version of Python with the packages we have ready-made for you to use. - -To finish off this "question", run the following command and note in the result in the jupyter notebook. - -[source,bash] ----- -which python3 ----- - -Okay, now, load up our `python/f2022-s2023` module and run `which python3` once again. What is the result? Surprised by the result? Any ideas what this is doing? If you are curious, feel free to ask in Piazza! Otherwise, congratulations, you've made it through the first project! - -.Items to submit -==== -- The output from running `which python3` before and after loading the `python/f2022-s2023` module. -- Any other comments you'd like to include. -==== - -=== Submitting your Work - -++++ - -++++ - -Congratulations, you just finished your first assignment for this class! Now that we've written some code and added some markdown cells to explain what we did, we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook (which can be done quickly by pressing the "double play" button along the top of the page), and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb` file, an additional file should be included for each programming language in the project containing all of the code from that langauge that is in the project. A full list of files required for the submission will be listed at the bottom of the project page. - -Let's practice. Take the R code from this project and copy and paste it into a text file with the `.R` extension. Call it `firstname-lastname-project01.R`. Do the same for each programming language, and ensure that all files in the submission requirements below are included. Once complete, submit all files as named and listed below to Gradescope. - -.Items to submit -==== -- `firstname-lastname-project01.ipynb`. -- `firstname-lastname-project01.R`. -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.sql`. -- `firstname-lastname-project01.sh`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 21 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project02.adoc deleted file mode 100644 index 4c72c5e54..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project02.adoc +++ /dev/null @@ -1,392 +0,0 @@ -= TDM 40100: Project 2 -- 2023 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100, you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own `sqlite3` database file. -- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data. -- Insert data into your database. -- Run one or more queries to test out the end result. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json` - -== Questions - -=== Question 1 (1 pt) -[upperalpha] -.. Write a function to determine how big our 4 files are, given a size to return in. (i.e. bytes, megabytes, etc.) -.. Run your function given bytes, kilobytes, megabytes, and gigabytes in turn. -.. Approximately how many books, reviews, and authors are included in the datasets? -.. Write code to get the size of one of the images in our dataset in bytes, without downloading the image. - -The goodreads dataset contains a variety of files. With that being said there are 4 files which hold the bulk of the data. The rest is _mostly_ derivatives of those 4 files. - -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json` - -Write a `bash` function that takes an argument that indicates the size unit to return in, and returns the size of a file in that unit by running `du`. For example, if I wanted to know how many **bytes** the `goodreads_books.json` file takes up using your function, I could run the following: - -[source,bash] ----- -get_size b ----- - -[NOTE] -==== -In the kernel on Anvil that we are using, "shells" do not persist across different cells. What this means practically is that any bash function you define is only defined in that cell. When running a function, it must be defined in the same cell that you are running it in. -==== - -[TIP] -==== -`du` is a `bash` command that returns the size of files or directories on the filesystem. If I wanted to, for example, know how many megabytes the 'Summer23' directory is taking up on my filesystem, I could run the following: -[source,bash] ----- -du -BMB $HOME/Summer23 ----- -==== - -Run your function 4 times, passing in bytes, kilobytes, megabytes, and gigabytes in turn. Does your output make sense for each size? If not, double check your function to make sure you are using the correct unit of size for each `du` command. Don't forget to try `du --help` if you are having trouble. - -[NOTE] -==== -For this project we will be using the convention that 1MB is 1,000,000 bytes, and 1GB is 1,000,000,000 bytes. We understand that historically this may have differed, but we will be operating off this convention for this project. -==== - -_Approximately_ how many books, reviews, and authors are included in the datasets? - -For this part of the question, I would recommend first looking at the head of each file using `head -n10 filename`, replacing filename with your file. How many books are on each line of the file on average? Once you know this, you can count how many lines are in the file (using `wc`) and extrapolate to find your solution. - -Finally, let's take a look at the first book. - ----- -{"isbn": "0312853122", "text_reviews_count": "1", "series": [], "country_code": "US", "language_code": "", "popular_shelves": [{"count": "3", "name": "to-read"}, {"count": "1", "name": "p"}, {"count": "1", "name": "collection"}, {"count": "1", "name": "w-c-fields"}, {"count": "1", "name": "biography"}], "asin": "", "is_ebook": "false", "average_rating": "4.00", "kindle_asin": "", "similar_books": [], "description": "", "format": "Paperback", "link": "https://www.goodreads.com/book/show/5333265-w-c-fields", "authors": [{"author_id": "604031", "role": ""}], "publisher": "St. Martin's Press", "num_pages": "256", "publication_day": "1", "isbn13": "9780312853129", "publication_month": "9", "edition_information": "", "publication_year": "1984", "url": "https://www.goodreads.com/book/show/5333265-w-c-fields", "image_url": "https://images.gr-assets.com/books/1310220028m/5333265.jpg", "book_id": "5333265", "ratings_count": "3", "work_id": "5400751", "title": "W.C. Fields: A Life on Film", "title_without_series": "W.C. Fields: A Life on Film"} ----- - -If we want to find the size of the image for this book (given by the `image_url` field), we could just download it and run `du`. However, if we wanted to do this on a whole bunch of images it would be extremely slow and take up an enormous amount of space on our filesystem, which is extremely unideal for large datasets. - -Instead, use `curl` in order to get the headers of the image, and `grep` to isolate the `Content-Length` portion of the header. Take the time to figure out how this works and make your solution clean, as we will be building on this idea in the next question and it will be tough if you don't understand it or have a very convoluted solution. - -[TIP] -==== -`curl -ILs ` will return the headers for a given url. -==== - -[NOTE] -==== -It is okay to manually copy/paste the link from the json. -==== - -.Items to submit -==== -- A function to calculate the size of a file in a given size. -- The size of each of the four files listed in bytes, kilobytes, megabytes, and gigabytes. -- The approximate number of books and reviews in the dataset. -- The size of the image for the first book, in bytes. -==== - -=== Question 2 (2 pts) -[upperalpha] -.. Write a `bash` function to return the average size of _n_ random images from our data. -.. Run your function 4 times, passing in 25, 50, 100, and 1000 in turn. -.. Write 2-3 sentences to determine which average is likely closest to the true average of the dataset, and why. Include an explanations for why the values you got may not match the theoretical expectation. - -In the last question, we got the size of a single image from our data. However, more often than not we will want to do this with many images in order to approximate how much space the images take on average, or to get any other sorts of summary statistics about the images in our data. - -[IMPORTANT] -==== -In the previous question we said it was okay to manually copy/paste the `image_url` -- this time, you _probably_ won't want to do that. You can use a `bash` tool called `jq` to extract the links automatically. I would recommend you don't use `jq` within the body of your function as this will greatly slow down runtime. Run `jq` once to extract the links to a text file, then use that text file in your function. - -The `--raw-output` option to `jq` _may be_ useful as well. - -`shuf` can return a random subset of _n_ lines from a file, which will help you get a random subset of images to average. -==== - -Write a function that takes an argument _n_ that indicates the number of images to download, and returns the average size of _n_ random images from our data. An example general outline of this function is included below to push you in the right direction. - -[source,bash] ----- -avg_img_size () { - # Record start time - start_time=$(date +%s.%N) - - # set helper variables - books_file="/anvil/projects/tdm/data/goodreads/goodreads_samples/goodreads_books.json" - - # get a subset of _n_ image links - - # initialize accumulating variables - let total_size=0 count=0 - - # get size of each image, then average all sizes - # THIS IS THE BIG SECTION YOU WILL NEED TO FILL IN - - # print output - echo "Total Size: $total_size bytes" - echo "Num Files: $count" - echo "Average Size (N=$1): $average_size bytes" - - # remove temp files - - # Record end time, calculate elapsed time - end_time=$(date +%s.%N) - elapsed_time=$(echo "$end_time - $start_time" | bc) - echo "$elapsed_time sec runtime" -} ----- - -Your function can be tested as follows: - -[source, bash] ----- -avg_img_size 25 -echo " " -avg_img_size 50 -echo " " -avg_img_size 100 -echo " " -avg_img_size 1000 ----- - -[TIP] -==== -While retrieving the size of each image, you may notice that addition of the sizes is not working. If this is the case, try running `od -c` on the size to see if there are any hidden characters that could be causing problems. If so, you can use `tr` to remove them. -==== - -[NOTE] -==== -The `start_time` and `end_time` code that you see will print the time it takes your function to run. This can be a helpful tool while trying to improve runtime, as inefficient functions will really start to slow down as we increase the number of images we want to average. -==== - -Run your function using the testing code provided above, which returns the average size of images given a subset of 25, 50, 100, and 1000 images. In a markdown cell, write 2-3 sentences explaining which average you received is theoretically closest to the 'true average' size of an image in the dataset, and why. - -[NOTE] -==== -1000 images is a pretty large amount, so don't expect your function to finish running instantly. My solution to this question took 86 seconds for 1000 images! If you are having a hard time getting your 1000 image test to run in a reasonable amount of time, go back to the 25 or 50 image tests and try and speed that one up first. -==== - -.Items to submit -==== -- Function to calculate the average of _n_ random images from our data. -- Results of running functions on subsets of 25, 50, 100, and 1000 images. -- 2-3+ sentences explaining which subset theoretically produces the most accurate average to the whole dataset, why, and why your results may differ from the theoretical expectation. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Create a directory called `goodreads_samples` somewhere in your `$HOME` directory. -.. Create ~100mb random subsets of our 4 main files in your `goodreads_samples` directory. -.. Double check that your subset files are an appropriate size with `du`. - -Okay, so _roughly_, in total we are looking at around 27 gb of data. With that size it will _definitely_ be useful for us to create a database. After all, answering questions like: - -- What is the average rating of Brandon Sandersons books? -- What are the titles of the 5 books with the most number of ratings? - -would be quite difficult with the data in its current form. - -Realistically, things are not very straightforward if we hand you this data and say "get that info please". _However_, if we had a nice `sqlite` database the same tasks would be trivial! In the rest of this project, we will set up a `sqlite` database, and populate it with the data from the `goodreads` dataset with the end goal of creating a small database that make it easy to answer questions like the ones above. - -First, before we do that, it would make sense to get a sample of each of the datasets. Working with samples just makes it a lot easier to load the data up and parse through it. - -Use `shuf` to get a random sample of the `goodreads_books.json` and `goodreads_reviews_dedup.json` datasets. Approximate how many rows you'd need in order to get the datasets down to around 100 mb each, and do so. Put the samples, and copies of `goodreads_book_authors.json` and `goodreads_book_series.json` in a directory called `goodreads_samples` anywhere inside your $HOME directory. - -[WARNING] -==== -Do **NOT** use the `goodreads_samples` directory in the `goodreads` directory. This data is out of date and your results will not match ours, almost certainly causing you to lose points. -==== - -[NOTE] -==== -It just needs to be approximately 100mb -- no need to fuss, as long as it is within a 50mb margin it should be fine. -==== - -.Items to submit -==== -- `goodreads_samples` directory containing our 4 subset files. -- Code to check the size of our 4 subset files. -==== - -=== Question 4 (1 pt) -[upperalpha] -.. Write out the keys in each of the json files (excluding `goodreads_reviews_dedup.json`), and list the appropriate storage class to use. - -Check out the 5 storage classes (which you can think of as 'data types' in languages like Python, C, and R) that `sqlite3` uses: https://www.sqlite.org/datatype3.html - -When we are looking into constructing a database, we need to think about what types of data we need to be storing. For example, if we wanted to store a bunch of values of a structure like `12349234`, an INTEGER would likely work well. However, if we are trying to store values like `0012349234`, storing as an INTEGER will lose us our 2 leading zeroes. In this case, TEXT may be more appropriate. These sort of small technicalities can make a big difference in how well our data is stored in our database, so be sure to look at the different values in each field before assigning them a type appropriate to store those values. - -In a markdown cell, write out each of the keys in each of the json files (excluding `goodreads_reviews_dedup.json`), and list the appropriate storage class to use. For example, I've provided an example solution for `goodreads_reviews_dedup.json`. - -- user_id: TEXT -- book_id: INTEGER -- review_id: TEXT -- rating: INTEGER -- review_text: TEXT -- date_added: TEXT -- date_updated: TEXT -- read_at: TEXT -- started_at: TEXT -- n_votes: INTEGER -- n_comments: INTEGER - -[NOTE] -==== -You don't need to copy/paste the solution for `goodreads_reviews_dedup.json` since we provided it for you. -==== - -[IMPORTANT] -==== -You do not need to assign a type to the following keys in `goodreads_books.json`: `series`, `popular_shelves`, `similar_books`, and `authors`. -==== - -[TIP] -==== -- Assume `isbn`, `asin`, `kindle_asin`, `isbn13` columns _could_ start with a leading 0. -- Assume any column ending in `_id` could _not_ start with a leading 0. -==== - -.Items to submit -==== -- List of the keys in the json files and the appropriate storage class to use. -==== - -=== Question 5 (2 pts) -[upperalpha] -.. Write a `CREATE TABLE` statement for each of the 4 tables we planned in the previous question. -.. Run the provided code snippets below to verify that your tables were created correctly. - -We have done a lot of setup in the previous questions, and now we are finally ready to create our mini-database! We will do this using `CREATE TABLE` statements in `sqlite3`, while sourcing data from the `goodreads_sample` directory we created in the previous question. While you will have to run your `CREATE TABLE` statements in a terminal with `sqlite3` launched in order for them to work, please paste them into your notebook as well in order to recieve points for this question. - -Let's start by launching `sqlite3` like so: -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load sqlite/3.39.2 - -sqlite3 my.db # this will create an empty database ----- - -This will put us inside our `sqlite3` session, and we can start running our commands, `sqlite3`-specific dot functions, and SQL queries. While you work on the next portion of this question, feel free to use the following code to validate that your `CREATE TABLE` statements are working as expected. Additionally, https://www.sqlitetutorial.net[this website] has more information about sqlite for you to read up on in your extra time. - -[source,sql] ----- -.tables ----- - -[TIP] -==== -Running `.help` once you start your `sqlite3` session will give you a list of all the `sqlite3`-specific dot functions you can use. For example, `.tables` will list all the tables in your database. -==== - -For now, we will only worry about creating tables with the columns that we identified in the previous question. This means that we will be leaving out the `series`, `popular_shelves`, `similar_books`, and `authors` columns in the `goodreads_books.json` file. - -Now that we have our `sqlite3` session launched, let's create our first table. In Question 4, we essentially created an outline of what each table's columns would contain. Translate each of the lists of keys and storage classes you made into a `CREATE TABLE` statement, and run it in your `bash` shell. **Don't** forget to paste your statements into your Jupyter notebook so we can see them! As an example, I have provided the `CREATE TABLE` statement for the `goodreads_reviews_dedup.json` file below. You should still run this statement as you want to create this table as well. - -[source,sql] ----- -CREATE TABLE reviews ( - user_id TEXT, - book_id INTEGER, - review_id TEXT, - rating INTEGER, - review_text TEXT, - date_added TEXT, - date_updated TEXT, - read_at TEXT, - started_at TEXT, - n_votes INTEGER, - n_comments INTEGER -); ----- - -[NOTE] -==== -While concepts like primary and foreign keys are extremely important and useful, we will not be covering them in this project. For now, just focus on building the four tables we outlined in Question 4, ensuring you are using the correct types. We will also cover restrictions like `UNIQUE` or `NOT NULL` in future projects, so feel free to just make the basic table for now. -==== - -Finally, run all of the below statements in your Jupyter notebook to verify (and show us) that your tables were created correctly. - -[source,ipython] ----- -%sql sqlite:////home/x-jaxmattfair/my.db # change x-jaxmattfair to your username ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='reviews'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='books'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='series'; ----- - -[source,ipython] ----- -%%sql - -SELECT sql FROM sqlite_master WHERE name='authors'; ----- - -.Items to submit -==== -- SQL `CREATE TABLE` statements for each of the 4 tables, and to create your database. -- Code snippets above and the results of running those code snippets. -==== - -=== Submitting your Work - -Well done, you've finished your second project for this class and created your first database in `sqlite3`! Make sure to save your work and submit it to Gradescope in the correct format. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project02.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - -Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: - -++++ - -++++ diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project03.adoc deleted file mode 100644 index cc09385e1..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project03.adoc +++ /dev/null @@ -1,300 +0,0 @@ -= TDM 40100: Project 3 -- 2023 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100 you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own `sqlite3` database file. -- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data. -- Run one or more queries to test out the end result. -- Demonstrate the ability to normalize a series of database tables. -- Wrangle and insert data into database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json` - -== Questions - -If for whatever reason you didn't do the previous project, please run the following code in a Jupyter notebook to create a `sqlite3` database called `my.db` in your `$HOME` directory (this is the code you wrote in the previous project, so you can skip this if you already did it). Feel free to move it into a subdirectory if you would like. - -[source,ipython] ----- -%%bash - -rm $HOME/my.db -sqlite3 $HOME/my.db "CREATE TABLE reviews ( - user_id TEXT, - book_id INTEGER, - review_id TEXT, - rating INTEGER, - review_text TEXT, - date_added TEXT, - date_updated TEXT, - read_at TEXT, - started_at TEXT, - n_votes INTEGER, - n_comments INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE books ( - isbn TEXT, - text_reviews_count INTEGER, - country_code TEXT, - language_code TEXT, - asin TEXT, - is_ebook INTEGER, - average_rating REAL, - kindle_asin TEXT, - description TEXT, - format TEXT, - link TEXT, - publisher TEXT, - num_pages INTEGER, - publication_day INTEGER, - isbn13 TEXT, - publication_month INTEGER, - edition_information TEXT, - publication_year INTEGER, - url TEXT, - image_url TEXT, - book_id TEXT, - ratings_count INTEGER, - work_id TEXT, - title TEXT, - title_without_series TEXT -);" - -sqlite3 $HOME/my.db "CREATE TABLE authors ( - average_rating REAL, - author_id INTEGER, - text_reviews_count INTEGER, - name TEXT, - ratings_count INTEGER -);" - -sqlite3 $HOME/my.db "CREATE TABLE series ( - numbered INTEGER, - note TEXT, - description TEXT, - title TEXT, - series_works_count INTEGER, - series_id INTEGER, - primary_work_count INTEGER -);" ----- - -[source,ipython] ----- -%sql sqlite:////home/x-myalias/my.db ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM reviews limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM books limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM authors limit 5; ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM series limit 5; ----- - -[source,ipython] ----- -%%bash - -rm -rf $HOME/goodreads_samples -mkdir $HOME/goodreads_samples -cp /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json $HOME/goodreads_samples/ -cp /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json $HOME/goodreads_samples/ -shuf -n 25600 /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json > $HOME/goodreads_samples/goodreads_books.json -shuf -n 94200 /anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json > $HOME/goodreads_samples/goodreads_reviews_dedup.json ----- - -=== Question 1 (1 pt) -[upperalpha] -.. Rename the `image_url` column from the `books` table to `book_cover`. -.. Show via a Before/After query that the column was renamed successfully. - -In the last project, we created a whole bunch of tables and columns to store our data. As of yet, however, they are still quite empty. Before we start populating them with data, we should always double check our field names and make sure they are as concise and descriptive as possible. As you will likely see in the future, tables with data in them are much more tricky to modify than empty tables. - -Before we start, let's get a quick look at `books` by running the below command to show all the current column names in our table. - -[source,ipython] ----- -%%sql -SELECT * FROM books LIMIT 0; ----- - -[TIP] -==== -If running the above code gives you an error relating to 'DATABASE_URL variable not set', make sure you are running the line of code that establishes a connection to your database first. This snippet is provided above, and is succinctly equivalent to `%sql sqlite:////filepath/to/database/my.db`. -==== - -First, let's rename the `image_url` column from the `books` table. This name doesn't tell us what the image even has to do with. Instead, let's name the column `book_cover`, which tells us that this column contains all the image URLs for the covers of the books. - -Remember, there is a wealth of online resources related to SQL that you can use to help you solve this problem. However, if you are having trouble figuring out where to start, the SQL `RENAME` command will be a good direction to start moving in. - -After renaming our column, let's verify our change by again querying all the columns in `books`. - -.Items to submit -==== -- Code to rename the column `image_url` to `book_cover`. -- Before and After query to show successful rename. -==== - -=== Question 2 (1 pt) -[upperalpha] -.. For each table, make the listed changes below through your Jupyter notebook as necessary to normalize the database. - -Check out a line of the `goodreads_books.json` data: - -[source,ipython] ----- -%%bash - -head -n 1 $HOME/goodreads_samples/goodreads_books.json ----- - -[TIP] -==== -Don't have a `goodreads_samples` directory? Run the last snippet of code above Question 1 (it starts with `rm -rf`) to create it. This directory is covered in more detail in the previous project, so it would be a good idea (although not strictly necessary) to reread that project before continuing. -==== - -Recall that in the previous project, we just ignored the following fields from the `books` table: `series`, `similar_books`, `popular_shelves`, and `authors`. We did this because those fields are more complicated to deal with. - -Read https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description[this] article on database normalization from Microsoft. We are going to do our best to _normalize_ our tables with these previously ignored fields taken into consideration. Write 2-3 sentences in a markdown cell on the differences between 1st, 2nd, and 3rd normal forms, and the importance of normalizing our database. - -To elaborate on the provided reading material, let's briefly discuss primary and foreign keys. - -A 'primary key' can be thought of as a unique piece of information that all of the data in that row is tied to. For example, if I have an `employees` table with salary information, names, emails, phone numbers, and employee ids, the primary key would likely be the `employee_id` as it is unique to each employee. - -A 'foreign key' is a piece of information that is a primary key in another table. For example, if I have a `departments` table with department names and department ids, the primary key for that table, `department_id`, could be used as a foreign key in the `employees` table in order to indicate what department an employee is in. - -Let's begin getting into 1st form by setting some practical naming conventions. Note that these are not critical by any stretch, but can help remove some guesswork when navigating a database with many tables and ids. - -Remember, we created 4 tables: + -- `reviews` + -- `books` + -- `authors` + -- `series` + - -Go through each of these tables and make the following changes: + -[numeric] -. Every table's primary key should be named `id`, unless it is a composite key (more on these later). For example, instead of `book_id` in the `books` table, it would make sense to call that column `id` because `book` is implied from the table name. + -. Every table's foreign key should reference the `id` column of the foreign table and be named "foreign_table_name_id". For example, if we had a foreign key in the `books` table that referenced an author in the `authors` table, we should name that column `author_id`. + -. Other than the primary and foreign keys for a table, do not include `id` in a column name wherever possible. An example of this would be `work_id` in `books`, which could be renamed to `work_num`, for example. + -. Keep table names plural, when possible -- for example, not the `book` table, but the `books` table. + -. Keep column names singular, when appropriate -- for example, not the `authors` column, but the `author` column. + - -[TIP] -==== -You should change the following number of columns in each table: + -- `books`: 2 + -- `reviews`: 2 + -- `authors`: 1 + -- `series`: 1 + -==== - -.Items to submit -==== -- All code to modify tables to normalize our naming conventions. -==== - -=== Question 3 (1 pt) -[upperalpha] -.. Create a junction table between `authors` and `books` called `books_authors`. -.. Write an SQL query to find every book by author with id `12345`. - -Things so far have been pretty simple. Renaming columns and deciding types is a good starting point for normalizing our database, but now we need to start working on the more challenging bits. - -First off, take a look at https://stackoverflow.com/questions/70609439/what-is-the-point-of-a-junction-table[this stack overflow post] to get a basic idea for why junction tables are useful, and start thinking on your own about how they might fit into our database organization. - -For a more concrete example in our data, consider this: + -A book can have many authors, and an author can have many books. This is an example of a many-to-many relationship. While we could have `author1`, `author2`, `author3` columns, this would be very bad practice. How do we delegate authorship consistently between the columns? What if we have a book with 4 authors? 5? 10? 100? 1000? You can see how this approach quickly becomes unmanageable. - -Instead, we should create a junction table! Junction tables contain their own primary key, along with foreign keys to the tables they connect. In this case, we would have a `books_authors` table with a primary key, a `book_id` foreign key, and an `author_id` foreign key. This way, we can have as many authors as we want for a book, and as many books as we want for an author! - -Now that you (hopefully) have a good understanding of _junction tables_, create a _junction_ table (using a single `CREATE TABLE` statement) that effectively _normalizes_ the `authors` field in the `books` table. Call this new table `books_authors` (see point 4 in Question 2 -- this is the naming convention we want). - -Make sure to include your `CREATE TABLE` statement in your notebook after you create the table (remember, you will need to create the table in an sqlite3 session in a terminal). - -[TIP] -==== -There should be 3 columns in the `authors_books` table, a primary key field and two foreign key fields. -==== - -[IMPORTANT] -==== -Make sure to properly apply the https://www.sqlitetutorial.net/sqlite-primary-key/[primary key] and https://www.sqlitetutorial.net/sqlite-foreign-key/[foreign key] keywords. -==== - -Write an SQL query to find every book by author with id 12345. While you won't have any way to test results yet (as our database is still empty!), but you should still write out a query that at the least _looks_ roughly correct and returns a join of the 3 tables (which will return no data because all three tables are empty). - -[TIP] -==== -You will need to use _joins_ and our junction table to perform this query. -==== - -Make sure that all the work you did for this question is copied into your Jupyter notebook, and run when appropriate. - -.Items to submit -==== -- Your `CREATE TABLE` statement for the junction table. -- A draft SQL query to find every book by author with id 12345. -==== - - - -=== Submitting your Work - -Good work, you've made it to the end of your third project for TDM 401. In the fourth project, we will finish preparing your database for data entry! As always, ensure that all your work is visible as you expect in your submission to ensure you get the full points you deserve. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project03.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project04.adoc deleted file mode 100644 index f7e3f6d49..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project04.adoc +++ /dev/null @@ -1,129 +0,0 @@ -= TDM 40100: Project 4 -- 2023 - - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100 you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -- Create your own `sqlite3` database file. -- Analyze a large dataset and formulate `CREATE TABLE` statements designed to store the data. -- Run one or more queries to test out the end result. -- Demonstrate the ability to normalize a series of database tables. -- Wrangle and insert data into database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/datadownloadsAugust2023/goodreads_reviews_dedup.json` - -== Questions - - - -=== Question 1 (1 pt) -[upperalpha] -.. Create a junction table between `books` and `series` called `books_series`. -.. Write a list of each column in the `books_series` table, their data type, and a brief description of what they contain. -.. Write an SQL query to find every book in a series with `series_id` 12345. - -Assume that a series can have many books and a book can be a part of many series. Following a similar process as in the previous project, create another junction table called `books_series`. - -In a markdown cell, list the columns in the `books_series` table, their data type, and a brief description of what they represent. - -Finally, write an SQL query to find every book in a series with `series id` 12345. - -.Items to submit -==== -- `CREATE TABLE` statement for `books_series`. -- List of columns in `books_series` with descriptions. -- SQL query to find every book in a series with `series id` 12345. -==== - -=== Question 2 (2 pts) -[upperalpha] -.. Create a new database called `my_goodreads.db` that is the same as our `my.db` database (after normalization), but with primary and foreign key constraints set. - -As you may have noticed, we have determined the primary and foreign keys for our tables, but haven't actually set them in the database yet. Unfortunately, sqlite3 has no way of setting these constraints after creating the able, so we will be rewriting our `CREATE TABLE` statements inside a new database labeled `my_goodreads.db`. - -Going through your Jupyter notebooks from the previous project (or the beginning of this project) recreate the existing database we have made, but this time with primary and foreign key constraints set. - -While doing this, remember to stick to the naming conventions and normalization we established in this project, and don't forget to create your junction tables. It is particularly important to set primary and foreign key constraints in junction tables, as this gives us an added layer of data integrity when we are trying to modify data in our database. - -[TIP] -==== -If you are struggling or need a sanity check on your work, below is the number of constraints you will need to set for each table. - -- `books` : 1 primary key -- `reviews`: 1 primary key -- `authors`: 1 primary key -- `series` : 1 primary key -- `authors_books` : 1 primary key, 2 foreign keys -- `books_series` : 1 primary key, 2 foreign keys - -You can check the constraints on a table by running `.schema table_name` in a sqlite3 session. -==== - -.Items to submit -==== -- Modified `CREATE TABLE` statements for our new database, with proper constraints set. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Create a new junction table to handle `similar_books`. -.. Create a new table and accompanying junction table to handle `popular_shelves`. -.. Write 2+ sentences detailing an alternative approach to handling `popular_shelves`, and any benefits your approach has over the one provided. -.. Write two queries that use all of the new tables created during this question. - -The remaining two fields that need to be dealt with are `similar_books` and `popular_shelves`. These fields are a bit more complicated than the others, so you will need to think a bit more before implementing a solution. Remember to keep normalization in mind. - -Firstly, for **similar_books**, I would recommend another junction table. As there is nothing to stop you from linking a table to itself, you can simply create a junction table that links books to each other. Write (and run) a `CREATE TABLE` statement that creates a junction table between `books` and `books` called `similar_books`. Paste your `CREATE TABLE` statement into your Jupyter notebook for reference. - -Next, for **popular_shelves**, let's create two more tables. First, create a `shelves` table with `id` and `name` columns. Choose data types appropriately. Next, create a junction table between `shelves` and `books` called `books_shelves` with the appropriate column names and types. Be sure to normalize everything, as we have been doing throughout this project. - -Now that we've created some rather straightforward approaches to handling these two data fields, do some more thinking on your own. Write at least two sentences in a markdown cell detaining a different approach to handling **popular_shelves**, and any benefits your approach has over this one. (Hint: _composite keys_!) - -Finally, write two queries that use all of the new tables we created (one for `similar_books` and one for `popular_shelves`). - -Ensure that all of your work is visible in your Jupyter notebook prior to submitting! - -.Items to submit -==== -- 3 `CREATE TABLE` statements for the new tables and junction tables. -- 2+ sentences in a markdown cell detailing your alternative approach and its benefits over the provided approach. -- 2 queries that use all of the new tables created during this question. -==== - -=== Submitting your Work - -Good work, you've made it to the end of your fourth project for TDM 401 and finished preparing your database for data entry! In the next project, we will begin populating our database with data! As always, ensure that all your work is visible as you expect in your submission to ensure you get the full points you deserve. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in Gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project04.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project05.adoc deleted file mode 100644 index 37376b38e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project05.adoc +++ /dev/null @@ -1,266 +0,0 @@ -= TDM 40100: Project 5 -- 2023 - -**Motivation:** The ability to use SQL to query data in a relational database is an extremely useful skill. What is even more useful is the ability to build a `sqlite3` database, design a schema, insert data, create indexes, etc. This series of projects is focused around SQL, `sqlite3`, with the opportunity to use other skills you've built throughout the previous years. - -**Context:** In TDM 20100, you had the opportunity to learn some basics of SQL, and likely worked (at least partially) with `sqlite3` -- a powerful database engine. In this project (and following projects), we will branch into SQL and `sqlite3`-specific topics and techniques that you haven't yet had exposure to in The Data Mine. - -**Scope:** `sqlite3`, lmod, SQL - -.Learning Objectives -**** -* [x] Create your own sqlite3 database file. -* [x] Analyze a large dataset and formulate CREATE TABLE statements designed to store the data. -* [ ] Run one or more queries to test out the end result. -* [x] Demonstrate the ability to normalize a series of database tables. -* [ ] Wrangle and insert data into database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/goodreads/goodreads_book_authors.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_book_series.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_books.json` -- `/anvil/projects/tdm/data/goodreads/goodreads_reviews_dedup.json` - -== Questions - -As you can see from the "Learning Objectives" section above, we are already almost done with our learning objectives for this portion of the course! We are going to be focusing on our last two uncovered learning objectives in this project, where we will first populate our database and then run some queries on it to test it out. - -The way that we actually go about our insertion is a bit open-ended. While we do all have a common goal of filling our database with out dataset sample, there are many ways to approach this in Python. In the second half of this project, we will be talking briefly about time complexity, which is an extremely important concept in computer science that will help us optimize our code before we move into the next project and work with 'parallel processing'. - -In the next project, we will run some more experiments that will time insertion and then project the time it would take to insert all of the data in order to gauge the effectiveness of our data ingestion methods. Finally, we will adjust some database settings to create a final product with polish that we can feel good about. - -=== Question 1 (1 pt) -[upperalpha] -.. Write a function, `scrape_image_from_url`, that takes an image URL and returns a bytes object of the image. -.. Run the code snippet provided to test your function, and recieve the "Correct Output" message. - -[NOTE] -==== -Before we start, make sure you have the 'Goodreads_samples' directory that we created in Project 2 (and created again in Project 3) in your Project 4 directory. You can just copy it over from the previous project using `cp`. If you were unable to do the previous two projects for any reason, the code used to generate the Goodreads Samples is included in the introduction section of Project 3 (the code snippet that starts with `rm -rf`). Make sure you understand what the code is doing before you run it (this is always good practice). -==== - -Let's start by copying over our database file from the previous project. If you were following our instructions about naming, it should be called `project03.db`. You can use some `bash` like below in order to do so, and you can rerun this code as many times as you need to in order to get a fresh start. - -[source,ipython] ----- -%%bash - -rm $HOME/project04.db # removes the file if it exists -cp /anvil/projects/tdm/data/goodreads/project03.db $HOME/project04.db # copies our project 3 database to our project 4 directory ----- - -Let's get started with our data ingestion/insertion. We will split this over a few different questions where we test our ability to ingest different types of data, and then wrap it all into one big ingestion/insertion function later in the project. This is good practice whenever developing, and we at The Data Mine strongly recommend this sort of iterative testing as you continue to grow and develop in your career. It will save you lots of time! - -Firstly, we should be able to fully recover all the `book_cover` images from our database alone. This means we'll need to handle scraping the image from the `image_url` in our JSON file and converting the image to `bytes` before inserting into the database. Take a look at https://the-examples-book.com/projects/project-archive/30100-2022-project04#question-2[this question] from TDM 30100, and write a function that, given an image url, returns the image as a bytes object. - -Verify that your function works by running the following code snippet: - -[source,ipython] ----- -import shutil -import requests -import os -import uuid -import hashlib - -url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' -my_bytes = scrape_image_from_url(url) -m = hashlib.sha256() -m.update(my_bytes) -out = m.hexdigest() -correct = 'ca2d4506088796d401f0ba0a72dda441bf63ca6cc1370d0d2d1d2ab949b00d02' - -if m.hexdigest() == correct: - print("Correct Output") -else: - print("Incorrect Output:\n") - print(f"Expected: {correct}") - print(f"Recieved: {out}") ----- - -.Items to submit -==== -- Function `scrape_image_from_url` that returns bytes object of given image URL. -- Output of running the testing code snippet (should be "Correct Output"). -==== - -=== Question 2 (2 pts) -[upperalpha] -.. 4 functions to insert a book, author, series, or review into our database from the appropriate JSON file. -.. Print the head of your four 'main' tables to validate your functions. - -Okay, now that we've handled the main 'difficult' data we will be ingesting, we can start writing some subfunctions for each file. - -We will start with the simpler functions. For this question, write 4 functions, one for each file. They should be as follows: -**** -* `insert_books` -- takes all rows from `goodreads_books.json` and inserts it into the database. -* `insert_authors` -- takes all rows from `goodreads_book_authors.json` and inserts it into the database. -* `insert_seriess` -- takes all rows from `goodreads_book_series.json` and inserts it into the database. -* `insert_reviews` -- takes all rows from `goodreads_reviews_dedup.json` and inserts it into the database. -**** - -[NOTE] -==== -Do not worry about handling the _weird_ columns of our data yet (i.e. `popular_shelves`, `similar_books`, etc.). We will handle those in a later question. -==== - -If you are struggling on where to start with this question, slow down and consider things in very small steps. Our function outline should be something akin to: -**** -. Open the file. -. Iterate over each line in the file. -. Parse the line into a dictionary of values to insert. -. Insert the values into the database. -**** - -The small code snippet below should give you a small idea of how to start doing this, and https://www.sqlitetutorial.net/sqlite-python/insert/[this article] can provide more insight into how to insert the data into your database. - -[source,python] ----- -import json - -with open("/anvil/projects/tdm/data/goodreads/goodreads_books.json") as f: - for line in f: - print(line) - parsed = json.loads(line) - print(f"{parsed['isbn']=}") - print(f"{parsed['num_pages']=}") - break ----- - -You might be wondering why we want your functions to work line-by-line. This is because if we want to break out dataset into chunks and _parallelize_ our ingestion, this approach makes it much easier to do. We will not be covering paralel processing in this project, but the next project will have a huge focus on it, so take the time to get this right this week. - -Finally, print the head of your `books`, `authors`, `series`, and `reviews` tables to make sure that your functions are working as expected. (After running a function on the first line of a file, you should see a single row in each table.) - -.Items to submit -==== -- 4 functions as described above. -- The head of your `books`, `authors`, `series`, and `reviews` tables (with at least 1 row of data in them). -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Modify your `insert_book` function to insert `popular_shelves` and `similar_books` into their respective tables in our database. -.. Modify any functions necessary to update junction tables when inserting a book, author, series, or review. -.. Print the first 3 rows of each of your tables to validate your work. - -This process should be very similar to what you did in the last question, with the big exception being that now you will have to worry about inserting data into multiple tables and updating junction tables, along with iterating through lists of data to insert multiple rows into the database from one line in the file (as in the case of `similar books`). I would recommend drawing out your tables and how they connect to one another prior to trying to write code. This is a great way to visualize the problem, and is so common that most people in the industry have designated programs to create these diagrams for them (called "database viewers"). The actual diagram itself is called a "database schema diagram" or just a "schema" for short. - -Remember to post on Piazza, show up/call in to seminar or office hours, or email Dr. Ward if you are struggling with this question. We are here to help! - -.Items to submit -==== -- Modified functions to insert `popular_shelves` and `similar_books` into their respective tables, and to update junction tables when inserting a book, author, series, or review. -- The head of your `books`, `authors`, `series`, and `reviews` tables (with at least 3 rows of data in them). -==== - -=== Question 4 (1 pt) -[upperalpha] -.. Fully recover a `book_cover` and display it in your notebook. - -Demonstrate your database works by doing the following. - -. Fully recover a `book_cover` and display it in your notebook. -+ -[NOTE] -==== -[source,ipython] ----- -%%bash - -rm $HOME/test.db || true -sqlite3 $HOME/test.db "CREATE TABLE test ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - my_blob BLOB -);" ----- - -[source,python] ----- -import shutil -import requests -import os -import uuid -import sqlite3 - -url = 'https://images.gr-assets.com/books/1310220028m/5333265.jpg' -my_bytes = scrape_image_from_url(url) - -# insert -conn = sqlite3.connect('/home/x-jaxmattfair/test.db') -cursor = conn.cursor() -query = f"INSERT INTO test (my_blob) VALUES (?);" -dat = (my_bytes,) -cursor.execute(query, dat) -conn.commit() -cursor.close() - -# retrieve -conn = sqlite3.connect('/home/x-jaxmattfair/test.db') -cursor = conn.cursor() - -query = f"SELECT * from test where id = ?;" -cursor.execute(query, (1,)) -record = cursor.fetchall() -img = record[0][1] -tmp_filename = str(uuid.uuid4()) -with open(f"{tmp_filename}.jpg", 'wb') as file: - file.write(img) - -from IPython import display -display.Image(f"{tmp_filename}.jpg") ----- -==== -+ -. Run a simple query to `SELECT` the first 5 rows of each table. -+ -[NOTE] -==== -[source,ipython] ----- -%sql sqlite:////home/my-username/my.db ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM tablename LIMIT 5; ----- -==== -+ -[IMPORTANT] -==== -Make sure to replace "my-username" with your Anvil username, for example, x-jaxmattfair is mine. -==== - -.Items to submit -==== -- The printed, recovered image, and the code you used to do so, in your Jupyter notebook. -==== - -=== Submitting your Work -Nicely done, you've made it to the end of Project 4! This project was quite intensive, and we hope you learned a lot. If you have any questions or would like to learn more in-depth about topics covered in this project, please come to seminar. Dr. Ward and the TA team love talking to students, and we find that everyone learns from our shared conversations. As always, double, triple, and maybe even **quadruple** check that all your work is visible in your submission to ensure you get the full points you deserve. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project05.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project06.adoc deleted file mode 100644 index 1c1e818ec..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project06.adoc +++ /dev/null @@ -1,188 +0,0 @@ -= TDM 40100: Project 6 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this next series of projects, particularly the processes. - -**Context:** We are about to dive straight into a series of projects that emphasize working with images (with other fun things mixed in). We will start out with a straightforward task, that will involve lots of visual, manual analyses of images after you modify them to be easier to analyze. Then, in future projects, we will start to use computer vision to do this analysis for of us. - -**Scope:** Python, images, openCV, skimage - -.Learning Objectives -**** -- Use `numpy`, `skimage`, and `openCV` to process images. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -== Questions - -=== Question 1 (2 pts) -[upperalpha] -.. Write code to read in the image and display it in your notebook. -.. Write code to find the shape of the image. - -Let's ease into things by first taking a look at the image we are going to analyze for this project. First, read up on https://www.geeksforgeeks.org/matplotlib-pyplot-imshow-in-python/[this] matplotlib documentation on image processing, and then write a small snippet of Python code in order to read in our image and display it in our notebook. - -[TIP] -==== -Don't forget to run the below code in order to import the proper libraries for this project. -==== - -[source,python] ----- -import cv2 -import matplotlib.pyplot as plt -import matplotlib.image as mpimg -import numpy as np ----- - -[TIP] -==== -The functions `imread` and `imshow`, from matplotlib.image and matplotlib.pyplot respectively, will be useful for this question. -==== - -If you take a look at `img`, you will find it is simply a multidimensional `numpy.ndarray`, with a somewhat strange shape. We will discuss this shape more in Question 2, but for now you can note that the first two dimensions given are the height and width of the image, in pixels. Keep the third one in mind for now, we will discuss it later. - -For the last part of this question, write some code to print the shape of the image. What are the dimensions of the image? How many pixels wide and tall is it? -.Items to submit -==== -- Code to read in and display our image. -- Code used to print shape, and height and width of our image. -==== - - -=== Question 2 (2 pts) -[upperalpha] -.. Using openCV with two different methods, grayscale the image. -.. Find the shape of the grayscale image. -.. Write one to two sentences explaining any differences between this image shape and the shape you identified in the previous question. - -Now that we are familiar with the image we are working with, let's get started modifying it with the end goal of eventually making it easy to manually count how many balls of each color are in our image. - -First off, let's convert our image to grayscale. This is a good first step when analyzing an image, as it can give you an idea of the 'black-white contrast' for an image, which is often very useful in something referred to as 'contour-edge detection'. We will learn more about contour-edge detection, how to perform it, and what it is useful for later on in this course. - -Read through https://www.geeksforgeeks.org/python-grayscaling-of-images-using-opencv/[this short article] on how to do grayscaling of images with openCV. Then, using two different methods, convert the image to grayscale. Note that both of the methods of question are contained in the article provided. - -[TIP] -==== -The functions `imread` and `cvtColor` from openCV will be useful for this question, with the latter in conjunction with the `cv2.COLOR_BGR2GRAY` constant. -==== - -Once you've done this, print the image along with the shape of the image. How does this shape differ from the shape of the original image? What do you think the dimensions of the grayscale image represent? - - -.Items to submit -==== -- Code to grayscale the image, two different ways -- Printed image, grayscaled, -- Shape of grayscaled image, and explanation of what the dimensions represent. -==== - -=== Question 3 (2 pts) -[upperalpha] -.. Code to split the image into red, green, and blue color channels. -.. Code to display each channel (should be grayscale). -.. 1-2 sentences about our grayscale images and their usefulness in determining colors. - -While we are on the topic of color, let's take a look at the color channels of our image and how we can best analyze them individually. After all, outside of edge detection, you will likely want to talk about the different colors present in images you are analyzing. - -Read through https://www.geeksforgeeks.org/python-splitting-color-channels-opencv/[this short article] on how to split an image into its color channels with openCV. Then, write some code to split our image into its red, green, and blue color channels. Then, display each of the channels individually. You should see three grayscale images, each with slight but clearly noticeable differences from the others. - -From these images, do you think that it would be possible to determine which color ball is most common? Write a sentence or two discuss why or why not. - -.Items to submit -==== -- Code to split the image into its RGB color channels, and display each channel. -- 1-2 sentences about our grayscale images and their usefulness in determining colors. -==== - -=== Question 4 (2 pts) -[upperalpha] -.. Code to recolor our images into their respective colors. -.. Code to display each channel (should be colored). -.. Result of running provided code snippet to create red mask. - -[NOTE] -==== -You may notice when you first attempt this question that the colors are not matching up with what you expect. This is due to a difference in formatting between openCV and matplotlib, where openCV uses BGR instead of RGB. You can fix this by using the `cv2.cvtColor` function with the `cv2.COLOR_BGR2RGB` constant, similar to how you used it to grayscale images in question 2. -==== - -Next, write some code to recolor each of the channels with its respective color, and display the colored images. You should see three images, each with a different color tint. Note that the colors may not be exactly what you expect, but they should be close. This can be done by creating another channel (a simple numpy array) of all zeroes, and then copying your channel into the proper dimension of the numpy array before displaying it with `imshow` as usual. - -Here is an example of how to do this with the red channel, if you're getting stuck: - -[source,python] ----- -blank = 255 * (r_c.copy() * 0) - -# r_c represents the red channel from the last question -red_image = cv2.merge([blank, blank, r_c]) -plt.imshow(plt.imshow(cv2.cvtColor(red_image, cv2.COLOR_BGR2RGB)), plt.title('Red Channel')) ----- - -Finally, run the following code after you have shown your color images. This will create something called a `color mask`, which you will find is much more useful in determing the most common color of ball in our image. - -[source,python] ----- -# Define lower and upper bounds for red color in BGR format -lower_red = np.array([100, 0, 0]) # Lower bound -upper_red = np.array([255, 100, 100]) # Upper bound - -# Create a mask for red pixels -red_mask = cv2.inRange(img, lower_red, upper_red) - -# Apply the red mask to the original image -red_pixels = cv2.bitwise_and(img, img, mask=red_mask) - -plt.figure(figsize=(12, 4)) # Create a larger figure for better visualization -plt.subplot(131), plt.imshow(red_pixels), plt.title('Red Masked') -plt.subplot(132), plt.imshow(img), plt.title('Original Image') ----- - -.Items to submit -==== -- Code to recolor each channel, and display each channel. -- 1-2 sentences about our colored images, their usefulness/shortcomings in analyzing color, and how they could be improved upon. -==== - - -=== Question 5 (2 pts) -[upperalpha] -.. Code to create a color mask for green and blue color masks. -.. Code to display each color mask. - -As a wrap-up for this project, create your own color masks for green and blue. In this project, we won't worry about creating a mask for the less 'basic' colors in our image, as this will be more of a focus in the next project, but please feel free to experiment on your own. Additionally, note that the color limits you choose for your masks may not be perfect, but you should be able to get a good grasp for the relative presence of each color of ball in our image. Do your best to create realistic bounds, and be sure to print your final masks in your Jupyter notebook. - -.Items to submit -==== -- Code to create a color mask for green and blue masks. -- Code to display each color mask. -==== - - -=== Submitting your Work -Nicely done, you've made it to the end of Project 6! This is likely a very new topic for many of you, so please take the time to get things right now and learn all of the core concepts before we move on to more advanced topics in the next project. Unlike for most of your other projects, it is actually okay if you get the 'File to large to display' error in Gradescope. We will be excusing it for this project due to the nature of wanting to display a lot of images in our notebook. Just make sure that if you redownload your .ipynb file from Gradescope, it contains everything you expect it to. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -.Items to submit -==== -- `firstname-lastname-project06.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project07.adoc deleted file mode 100644 index a8633b84b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project07.adoc +++ /dev/null @@ -1,116 +0,0 @@ -= TDM 40100: Project 7 -- 2023 -:page-mathjax: true - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous project, we learned to manipulate image's basic factors by functions from the openCV `cv2` module. In this project, we will understand key image features, detect color dominance, and perform enhancing the image's visual quality by histogram equalization technique - -**Scope:** Python, images, openCV, Histogram equalization - -.Learning Objectives -**** -- Process images using `numpy`, `matplotlib`, and `openCV`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] - -.. Let's work with our `ballpit.jpg` again. In project 06, we split the image into its color channels (red, green and blue). With outputs for its color channels, please find out the average values of intensity for each channel -.. Display the average values for each channel with a bar chart. Briefly explain what is your finding from the bar chart - -[NOTE] -==== -* The average pixel values for the 3 channels can show the whole brightness of the image, reveal which color is dominant in the image, as well as image temperature - warm (reddish), cool(blueish) -* The average values of intensity of an image is calculated by summing up the intensity values of all pixels and dividing by the total number of pixels. Intensity is the value of a pixel. For a grayscaled image, the intensity has value from Black to white; for a color image in RGB, and each pixel has 3 intensity values, for R,G and B respectively. -==== -[TIP] -==== -* The average value can be calculated using numpy `mean()` -==== - -=== Question 2 (2 pts) - -.. In project 06, you created a red mask for red pixels and applied the red mask to the original image. Please create another 2 masks for green and blue channels. -.. Please identify how many pixels in the image are red, green and blue (respectively), and visualize the number of pixels for the 3 channels using a combined Histogram. Briefly explain what you found from the diagrams. - -[NOTE] -==== -A combined histogram here means a chart with 3 bars for the 3 channels respectively. The x-axis is the 3 channels, and the y-axis is the number of pixels for each channel. -==== - -[NOTE] -==== -* The summaries for each channel state the number of pixels for each color. So if `blue` has largest number, we can say blue is the dominant color of the image. -* To define lower and upper bounds for 3 colors depends on your personal judgement. You may need to adjust those thresholds value according to the different images and different purpose of your task. -==== -[TIP] -==== -`numpy sum()` can be used to summarize pixels -==== - -=== Question 3 (2 pts) - -[loweralpha] -.. Write a function called `equal_histogram_gray` using the histogram equalized technique: -... The function will accomplish a way to enhance image area that is too dark or light by adjusting the intensity values; it will only consider intensity but not any color information. -... The input argument to the function is an image -... The function returns a tuple of two images: one is the grayscaled image, and the other is a histogram-equalized grayscaled image - -.. Run the function with "ballpit.jpg" as input. Visualize the 2 output images aligning with the original "ballpit.jpg" using a Histogram chart - -[NOTE] -==== -`Histogram equalization` is a technique in `digital image processing`. It is a process where the intensity values of an image are adjusted to create a higher overall contrast. -`Digital Image Processing` is a significant aspect of data science. It is used to enhance and modify images so that their attributes are more easily understand. - -You may refer to more information about `Histogram Equalization` from the following website -https://www.educative.io/answers/what-is-histogram-equalization-in-python - -==== -[TIP] -==== -* The following 2 ways can be used to convert the image "ballpit.jpg" to grayscaled image -[source,python] -import cv2 -import matplotlib.image as mpimg -IMAGE = '/anvil/projects/tdm/data/images/ballpit.jpg' -img = mpimg.imread(IMAGE) -gray_img1 = cv2.imread(IMAGE, 0) -gray_img2= cv2.cvtColor(img.copy(), cv2.COLOR_BGR2GRAY) - -* The `cv2.equalizeHist()` function will be useful to solve the question. -==== - -=== Question 4 (2 pts) - -[loweralpha] -.. Process one of your favorite photos with the function `equal_histogram_gray`. Write 1-2 sentences about your input and output. Make sure to show the result of the images. - -Feel free to use `/anvil/projects/tdm/data/images/coke.jpg` -- the results are pretty neat! - - -Project 07 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project07.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project08.adoc deleted file mode 100644 index fdd8e2f1c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project08.adoc +++ /dev/null @@ -1,111 +0,0 @@ -= TDM 40100: Project 8 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this series of projects. - -**Context:** In the previous projects, we worked with images and implemented image Histogram Equalization, with some pretty cool results! In this project, we will continue to work with images key features, introduce YCbCr color space, and perform enhancing the image's visual quality by histogram equalization technique with colors - -**Scope:** Python, images, openCV, Histogram equalization, YCbCr, image digital fingerprint - -.Learning Objectives -**** -- - Process images using `numpy`, `matplotlib`, and `openCV`, -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -[NOTE] -==== -As in our previous projects, by default, a image is read in as a `RGB` image, where each pixel is represented as a value between 0 and 255, R represents "red", G represents "green", and B represents "blue". While it is natural for display with RGB, image with `YCbCr` format has advantages in many image processing, compression situations etc. - -`YCbCr` is a color space used in image processing. Y stands for "Luminance", Cb stands for "Chrominance Blue", Cr stands for "Chrominance red". You may get more information for `YCbCr` from https://en.wikipedia.org/wiki/YCbCr[YCbCr] - -`YCbCr` can be derived from the RGB color space. There are several Python libraries can be used to do the conversion, in this project we will use cv2 from OpenCV -[source, python] -import cv2 -rgb_img=cv2.imread(('/anvil/projects/tdm/data/images/ballpit.jpg')) -ycbcr_img = cv2.cvtColor(img,cv2.COLOR_BGR2YCrCb) -==== - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] -.. Please split `/anvil/projects/tdm/data/images/ballpit.jpg` into its `YCbCr`components and display them - - -[TIP] -==== -To display the YCbCr Y component, you will need to set the Cb and Cr components to 127. To display the Cb component, you will need to set the Cr and Y components to 127, etc. -==== - -[NOTE] -==== -The human eye is more sensitive to luminance than to color. As you can tell from the previous question, the Y component captures the luminance, and contains the majority of the image detail that is so important to our vision. The other Cb and Cr components are essentially just color components, and our eyes aren't as sensitive to changes in those components. -Luminance shows the brightness of an image. An RGB image can be converted to a YCbCr image. The histogram equalization then can apply to the luminance without impacting the color channels (Cb and Cr channels), which, if histogram equalization directly applies to an RGB image, it may cause image artifacts issues. "Artifacts issues" refers to unwanted distortion in an image. -Let's process some images in the following questions to makes this explicitly clear -==== - -=== Question 2 (2 pts) - -[loweralpha] -.. Please write a function named `equal_hist_rgb` to do Histogram Equalization directly to an image with RGB format. The parameter will be an image. The returns will be a Histogram Equalized colored image. Run the function with input `ballpit.jpg`. Show the output Histogram Equalized colored image. - - -=== Question 3 (2 pts) -[loweralpha] - -.. Please write a function named `equal_hist_YCrCb` that applies Histogram Equalization to an image, so that first the image will be converted from RGB format to YCrCb format, then apply Histogram Equalization. The parameter will be an image. The returns will be a Histogram Equalized colored image. Run the function with image `ballpit.jpg`. Show the output Histogram Equalized colored image. - -[TIP] -==== -We can read a 3-chanel RGB image by both `openCV cv2` and `matplotlib.image`. However, please do notice the output for cv2 is in BGR order but for matplotlib.image is in RGB order. - -`cv2.split()` will be useful to split the image to 3 channels -`cv2.equalizeHist()` will be useful to do histogram equalization. -`cv2.merge()` will be useful to combine all channels back to an equalized image - -==== -=== Question 4 (1 pt) - -[loweralpha] -.. Please plot the original image of `ballpit.jpg`, output images of it from question 2 and question 3 as a combined chart. What is your conclusion? - -=== Question 5 (1 pt) - -[loweralpha] -.. Please choose one of your favorite image as input to the two functions, display the original image and 2 output images in a combined histogram chart and state your finding -.. Please provide the digital fingerprints for all 3 images original one, two output images from two functions using `hashlib` library - -[TIP] -==== -Just like human has unique fingerprint. Every image has a unique -SHA-256 hash value. Even a tiny change of a pixel can cause a totally different SHA-256 hash value for the image. You may use a SHA-256 Hash value as the digital fingerprint for a image, for example - -[source,python] -import hashlib -with open("img_name","rb") as f: - img_bytes = f.read() -hash_o=hashlib.sha256() -fingerP = hash_o.update(img_bytes) -==== - -Project 08 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project08.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project09.adoc deleted file mode 100644 index e5457d5b9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project09.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= TDM 40100: Project 9 -- 2023 - -**Motivation:** Images are everywhere, and images are data! We will take some time to dig more into working with images as data in this project. - -**Context:** In the previous project, we were able to isolate and display the Y, Cb and Cr channels of our `ballpit.jpg` image, and we applied an image histogram equalization technique to Y and then merged 3 components, to an equalized image. We understood the structure of an image and how the image's luminance (Y) and chrominance (Cb and Cr) contributed to the whole image. The human eye is more sensitive to the Y Channel than color channels Cb & Cr. In this project, we will continue to work with 'YCbCr` images as we delve into some image compression techniques, we will implement a variation of jpeg image compression! - -**Scope:** Python, images, openCV, YCbCr, downsampling, discrete cosine transform, quantization - -.Learning Objectives -**** -- Be able to process images compression utilizing using `openCV` -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/images/ballpit.jpg` - -== Questions - -[NOTE] -==== -Some helpful links that are really useful. - -- https://en.wikipedia.org/wiki/JPEG -- https://en.wikipedia.org/wiki/Quantization_(image_processing) -- https://home.cse.ust.hk/faculty/golin/COMP271Sp03/Notes/MyL17.pdf (if you are interested in Huffman coding) - -JPEG is a _lossy_ compression format and an example of transform based compression. Lossy compression means that you can't retrieve the information that was lost during the compression process. In a nutshell, these methods use statistics to identify and discard redundant data. -==== - -[NOTE] -==== -Since the human eye is more sensitive to the Y Channel than color channels, we can reduce the resolution of the color components to achieve image compression. -we will first need to import some libraries -[source,python] -import cv2 -import numpy as np -import matplotlib.pyplot as plt - -To read the image, we will use openCV `cv2` -[source, python] -ballpit_bgr= cv2.imread('/anvil/projects/tdm/data/images/ballpit.jpg') - -Then convert the image from default rgb format to YCrCb format -[source,python] -ballpit_ycrcb = cv2.cvtColor(ballpit_bgr,cv2.COLOR_BGR2YCrCb) -==== -=== Question 1 (2 pts) -[loweralpha] - -First we will use a downsample technique, to compress an image by reducing the resolution of the color channels. It will return a YCrCb image with lower resolution. - -The following statement downsamples the image Cr channel to half (0.5) by using `cv2.resize` -[source,python] -ballpit_reduce = cv2.resize(ballpit_ycrcb[:,:,1],(0,0),fx=0.5,fy=0.5) - -Then we will need to use `cv2.resize()` to upsample the resolution reduced image to the original size by using the original image size's tuple -[source, python] -cv2.resize(ballpit_reduce,(ballpit_ycrcb.shape[1],ballpit_ycrcb.shape[0])) - -.. Please write a function named compress_downsample, it will take 3 arguments, a `jpg` file, a float number (fx) for the width downsampling factor; a float number (fy) for the Height downsampling factor. The returns will be a compressed ( downsampled ) image -.. Visualize the compressed image aligned with original image -.. Calculate the compression ratio - -[TIP] -You may use `cv2.imwrite` to save the compressed image to a file, get the size of it and divide by size of original image file - -=== Question 2 (1 pt) - -Second let's look into the discrete cosine transform technique -[NOTE] -Per https://www.mathworks.com/help/images/discrete-cosine-transform.html[MathWorks], the discrete cosine transform has the property that visually significant information about an image is concentrated in just a few coefficients of the resulting signal data. Meaning, if we are able to capture the majority of the visually-important data from just a few coefficients, there is a lot of opportunity to _reduce_ the amount of data we need to keep. So DCT is a technique allow the important parts of an image separated from the unimportant ones. - -E.g. -We will need to split the previous created `ballpit_ycrcb` into 3 Channels -[source,python] -y_c, cr_c,cb_c = cv2.split(ballpit_ycrcb) - -Next, apply 2D DCT to each channel by `cv2.dct` -[source,python] -y_c_dct = cv2.dct(y_c.astype(np.float32)) -cr_c_dct = cv2.dct(cr_c.astype(np.float32)) -cb_c_dct = cv2.dct(cb_c.astype(np.float32)) - -.. Please find the dimension for the output DCT blocks -.. Please print a 8*8 DCT blocks for each channel separately - -[TIP] -==== -* `.astype` is a method to convert numpy array to a certain data type. -* `np.flfoat32` is a data type of 32-bit floating point numbers array -* `shape` will be useful for the block dimensions -==== - -=== Question 3 (2 pts) - -Now let us try to visualize the output of DCT compression. One common way to do it will be to set value zero to some of the DCT coefficients, such as high-frequency ones at right or downward in the DCT output matrix, for example if we only want to keep top-left of 50*50 block of coefficients. We can set the value to zero to all other areas. For example, for the Y channel, -[source, python] -cut_v = 50 -y_c_dct[cut_v:,:]=0 -y_c_dct[:,cut_v:]=0 - -After updating the DCT coefficients, we can do inverse DCT on each channel to change back to its pixel intensities from its frequency representation, for example for Y channel -[source, python] -y_rec = cv2.idct(y_c_dct.astype(np.float32)) - -.. Please create a function named `compress_DCT` to implement image compression with DCT. The arguments are a jpg image, and a number for the coefficient area you would like to keep (we only need to consider same size for horizontal and vertical directions) -.. Visualize the DCT compressed image for ballpit.jpg align with the original one -.. Calculate the compression ratio - -=== Question 4 (2 pts) - -Next, let us try a quantization technique. Quantization reduces the precision of the DCT coefficients based on human perceptual characteristics. This introduces data loss, but reduces image size greatly. You can read more about quantization https://en.wikipedia.org/wiki/Quantization_(image_processing)[here]. Apparently, the human brain is not very good at distinguishing changes in high frequency parts of our data, but good at distinguishing low frequency changes. - -We can use a quantization matrix to filter out the higher frequency data and maintain the lower frequency data. One of the more common quantization matrix is the following. - -[source,python] ----- -q1 = np.array([[16,11,10,16,24,40,51,61], - [12,12,14,19,26,28,60,55], - [14,13,16,24,40,57,69,56], - [14,17,22,29,51,87,80,62], - [18,22,37,56,68,109,103,77], - [24,35,55,64,81,104,113,92], - [49,64,78,87,103,121,120,101], - [72,92,95,98,112,100,103,99]]) - ----- -We can quantize the DCT coefficients by dividing the value from quantization matrix and rounding to integer. For example for Y channel - -[source,python] -np.round(y_c_dct/q1) - -.. Please create a function called `compress_quant` that will use the function from question 3, select a 8*8 block and quantize the DCT coefficients before we do DCT inversion -.. Run the function with image ballpit.jpg, visualize the output compressed image align with original one -.. Calculate the compression ratio - -=== Question 5 (1 pt) - -.. Choose one of your favorite image as input and put all the steps together -.. Visualize the output aligned with the original image, and describe your findings briefly - -Project 09 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project09.ipynb`. - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project10.adoc deleted file mode 100644 index 30be81fbf..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project10.adoc +++ /dev/null @@ -1,174 +0,0 @@ -= TDM 40100: Project 10 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. For the remaining projects, we will be doing some scraping of housing data, and potentially: `sqlite3`, containerization, and analysis work as well. - -**Context:** This is the first in a series of web scraping projects with a focus on web scraping that incorporates of variety of skills we've touched on in previous Data Mine courses. For this first project, we will start slow with a `selenium` review with a small scraping challenge. - -**Scope:** selenium, Python, web scraping - -.Learning Objectives -**** -- Use selenium to interact with a web page prior to scraping. -- Use selenium and xpath expressions to efficiently scrape targeted data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) -[loweralpha] -The following code provides you with both a template for configuring a Firefox browser selenium driver that will work on Anvil, as well as a straightforward example that demonstrates how to search web pages and elements using xpath expressions, and simulate mouse clicks. Take a moment, run the code, and refresh your understanding. - -[source,python] ----- -import time -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -from selenium.webdriver.common.keys import Keys ----- - -[source,python] ----- -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) ----- - -[source,python] ----- -# navigate to the webpage -driver.get("https://books.toscrape.com") - -# full page source -print(driver.page_source) - -# get html element -e = driver.find_element("xpath", "//html") - -# print html element -print(e.get_attribute("outerHTML")) - -# find the 'Music'link in the homepage -link = e.find_element("xpath", "//a[contains(text(),'Music')]") -# click the link -link.click() -# We can delay the program to allow the page to load -time.sleep(5) -# get new root HTML element -e = driver.find_element("xpath",".//html") - # print html element -print(e.get_attribute("outerHTML")) ----- - -.. Please use `selenium` to get and display the first book's title and price in the Music books page -.. At same page, try to find book titled "How music works" then `click` this book link and then scrape and print book information: product description, upc and availability - -Take a look at the page source -- do you think clicking the book link was needed in order to scrape that data? Why or why not? - -[NOTE] -==== -You may get more information about `xpath` here: https://www.w3schools.com/xml/xpath_intro.asp [xpath] -==== - - -=== Question 2 (4 pts) - -Okay, Now, let us look into a popular website of housing market. https://zillow.com has extremely rich data on homes for sale, for rent, and lots of land. - -Click around and explore the website a little bit. Note the following. - -. Homes are typically list on the right hand side of the web page in a 21x2 set of "cards", for a total of 40 homes. -+ -[NOTE] -==== -At least in my experimentation -- the last row only held 1 card and there was 1 advertisement card, which I consider spam. -==== -. If you want to search for homes for sale, you can use the following link: `https://www.zillow.com/homes/for_sale/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: `https://www.zillow.com/homes/for_sale/lafayette-in_rb` -. If you want to search for homes for rent, you can use the following link: `https://www.zillow.com/homes/for_rent/{search_term}_rb/`, where `search_term` could be any hyphen separated set of phrases. For example, to search Lafayette, IN, you could use: `https://www.zillow.com/for_rent/lafayette-in_rb` -. If you load, for example, https://www.zillow.com/homes/for_rent/lafayette-in_rb and rapidly scroll down the right side of the screen where the "cards" are shown, it will take a fraction of a second for some of the cards to load. In fact, unless you scroll, those cards will not load, and if you were to parse the page contents, you would not find all 40 cards are loaded. This general strategy of loading content as the user scrolls is called lazy loading. - -.. Write a function called `get_properties_info` that, given a `search_term` (zipcode), will return a list of property information include zpid, price, number of bedroom, number of bathroom and square footage (sqft) . The function should both get all of the cards on a page, but cycle through all of the pages of homes for the query. - -[TIP] -==== -The following was a good query that had only 2 pages of results. - -[source,python] ----- -properties_info = get_properties_info("47933") ----- -==== - -[TIP] -==== -You _may_ want to include an internal helper function called `_load_cards` that accepts the driver and scrolls through the page slowly in order to load all of the cards. - -https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python[This] link will help! Conceptually, here is what we did. - -. Get initial set of cards using xpath expressions. -. Use `driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1])` to scroll to the last card that was found in the DOM. -. Find cards again (now that more may have loaded after scrolling). -. If no more cards were loaded, exit. -. Update the number of cards we've loaded and repeat. -==== - -[TIP] -==== -Sleep 5 seconds using `time.sleep(5)` between every scroll or link click. -==== - -[TIP] -==== -After getting the information for each page, use `driver.delete_all_cookies()` to clear off cookies and help avoid captcha. -==== - -[TIP] -==== -If you using the link from the "next page" button to get the next page, instead, use `next_page.click()` to click on the link. Otherwise, you may get a captcha. -==== - -[TIP] -==== -Use something like: - -[source,python] ----- -with driver as d: - d.get(blah) ----- - -This way, after exiting the `with` scope, the driver will be properly closed and quit which will decrease the likelihood of you getting captchas. -==== - -[TIP] -==== -For our solution, we had a `while True:` loop in the `_load_cards` function and in the `get_properties_info` function and used the `break` command in an if statement to exit. -==== - -=== Question 3 (2 pts) - -.. Please create a visualization based on the data from the previous question. Select any data points you find compelling and choose an appropriate chart type for representation. Provide a brief explanation of your choices - -Project 10 Assignment Checklist -==== -* Jupyter Lab notebook with your codes, comments and outputs for the assignment - ** `firstname-lastname-project10.ipynb`. - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project11.adoc deleted file mode 100644 index e7d8ad307..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project11.adoc +++ /dev/null @@ -1,164 +0,0 @@ -= TDM 40100: Project 11 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. We will continue to use "books.toscrape.com" to practice scraping skills - -**Context:** This is a second project focusing on web scraping combined with the BeautifulSoup library - -**Scope:** Python, web scraping, selenium, BeautifulSoup - -.Learning Objectives -**** -- Use Selenium and XPath expressions to efficiently scrape targeted data. -- Use BeautifulSoup to scrape data from web pages -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - - -In the previous project, you learned how to get the 'Music' category link in the webpage of "books.toscrape.com", and how to use `Selenium` to scrape books' information. The follow is the sample code for the solution for question 1. - -[source,python] ----- -import time -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.by import By - -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) - -driver.get("https://books.toscrape.com") -e_t = driver.find_element("xpath",'//article[@class="product_pod"]/h3/a') -e_p = driver.find_element("xpath",'//p[@class="price_color"]') -fst_b_t = e_t.text -fst_b_p =e_p.text - -# find book entitled "how music works" -book_link = driver.find_element(By.LINK_TEXT, "How Music Works") -book_link.click() -time.sleep(5) - -#scrape and print book information : product description, UPC and availability -product_desc=driver.find_element(By.CSS_SELECTOR,'meta[name="description"]').get_attribute('content') -product_desc -table = driver.find_element(By.XPATH, "//table[@class='table table-striped']") -upc= table.find_element(By.XPATH, ".//th[text()='UPC']/following-sibling::td[1]") -upc_value = upc.text -upc_value - -availability = table.find_element(By.XPATH, ".//th[text()='Availability']/following-sibling::td[1]") -availability_value = availability.text -availability_value -driver.quit() ----- -[NOTE] -In this project we will include BeautifulSoup in our webscraping journey. BeautifulSoup is a python library. You can use it to extract data from HTML or XML files. You may find more BeautifulSoup information here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ - -== Questions - -=== Question 1 (2 pts) - -.. Please create a function called "get_category" to extract all categories' names in the website. The function does not need any arguments. The function returns a list of categories' names. - -[TIP] -==== -* Use BeautifulSoup for this question -[source,python] -from bs4 import BeautifulSoup -==== -[TIP] -==== -* You can parse the page with BeautifulSoup -[source,python] -bs = BeautifulSoup(driver.page_source,'html.parser') -==== -[TIP] -==== -* Review the page source of the website's homepage, including categories located at the sidebar. The BeautifulSoup "select" method is useful to get names, like this: - -[source,python] -categories = [c.text.strip() for c in bs.select('.nav-list li a')] -==== - -=== Question 2 (2 pts) - -.. Please create a function called "get_all_books" to get first page books for a given category name from question 1. Use "Music_14" to test the function. The argument is a category name. The function returns a list of book objects with book titles, book price and book availability from the first webpage. - -[TIP] -==== -* Review the page source, you may find one "article" tag holds one book information. You may use find_all to find all "article" tags, like - -[source, python] -articles=bs.find_all("article",class_="product_pod") -==== - -[TIP] -==== -* You may create an object to hold the book information, like: -[source,python] -book = { - "title":title, - "price":price, - "availability":availability -} -==== - -[TIP] -==== -* You may use a loop to go through the books, like -[source,python] -for article in articles: - title = article.h3.a.attrs['title'] - price = article.find('p',class_='price_color').text - availability = article.find('p',class_='instock availability').text -# create a book object with the extract information - .... -==== -[TIP] -==== -* You may need a list to hold all book objects, and add all books to it, like -[source,python] -all_books=[] -... -all_books.append(book) -==== -[NOTE] -==== -* You may use different ways to solve the question, like use function "map" etc. -==== - -=== Question 3 (2 pts) - -You may have noticed that some categories like "fantasy_19" have more than one page of books. - -.. Please update the function "get_all_books" from question 2 so that the function can be used to get all books, even if there are multiple pages for the category. - -[TIP] -==== -* Look for pagination link "next" -==== - -=== Question 4 (2 pts) - -.. Look through the website "books.toscrape.com", and pick anything that interests you, and scrape and display those data. - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project11.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project12.adoc deleted file mode 100644 index 4b3a363b6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project12.adoc +++ /dev/null @@ -1,184 +0,0 @@ -= TDM 40100: Project 12 -- 2023 - -**Motivation:** In general, scraping data from websites has always been a popular topic in The Data Mine. In addition, it was one of the requested topics. We will continue use https://books.toscrape.com to practice scraping skills, visualize scrapped data, use `sqlite3` to save scrapped data to database - -**Context:** This is a third project focusing on web scraping combined with sqlite3 - -**Scope:** Python, web scraping, selenium, BeautifulSoup, sqlite3 - -.Learning Objectives -**** -- Visualize scraped data. -- Create tables for scraped data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -In the previous project, you have been able to scrape data from https://books.toscrape.com - -Now let's visualize your scrapped data - -.. Please visualize the books' price of music category with a bar plot. Split the prices into three price ranges: below 20, 20-30, above 30 - -[TIP] -==== -You may need to change the price to float, like -[source, python] -prices = [float(book['price'].replace('£','')) for book in books] - -books is the book list from the previous project's function "get_all_books", like this: - -books = get_all_books("Music_14") -==== -[TIP] -==== -You may use sum to group prices, like -[source,python] -price_less_20 = sum(1 for price in prices if price<20) -price_20_30 = sum(1 for price in prices if 30<=price<50) -... -==== -[TIP] -==== -You may use a bar chart, like -[source,python] -price_counts = [price_less_20, price_20_30,price_above_30] -labels = ["1","2","3"] -plt.bar(labels,price_counts,color=['purple','orange','green']) -# More plt settings and display statements -==== - -=== Question 2 (2 pts) - -.. Write `CREATE TABLE` statements to create 2 tables, namely, a `categories` table and a `books` table. - -[TIP] -==== -Check on the website for category information. The categories table may contain following fields -- 'id' a unique identifier for each category, auto increment -- 'category' like 'poetry_23' - -==== -[TIP] -==== -Check on the website for book information. The "books" table may contain following fields -- 'id' a unique identifier for each category, auto increment -- 'title' like 'A light in the Attic" -- 'category' like 'poetry_23' -- 'price' like 51.77 -- 'availability' like 'in stock(22 available)' - -==== - -[TIP] -==== -Use `sqlite3` to create the tables in a database called `$HOME/onlinebooks.db`. You can do all of this from within Jupyter Lab. - -[source,python] ----- -%sql sqlite:///$HOME/onlinebooks.db ----- - -[source,python] ----- -%%sql - -CREATE TABLE ... ----- - -Run the following queries to confirm and show your table schemas. - -[source, sql] ----- -PRAGMA table_info(categories); ----- - -[source, sql] ----- -PRAGMA table_info(books); ----- -==== - - -=== Question 3 (2 pts) - -.. Update the function "get_category" from project 11. After you get the information about categories from the website, populate the "categories" table with that data. -.. Run a couple of queries that demonstrate that the data was successfully inserted into the database. - -[TIP] -==== -Here is partial code to assist. - -[source,python] ----- -import sqlite3 -# connect to database -conn = sqlite3.connect('onlinebooks.db') -cur = conn.cursor() -for category in categories: - cur.execute('INSERT INTO CATEGORIES (CATEGORY) VALUES (?)',(category,)) -conn.commit() -conn.close() ----- -==== - -=== Question 4 (2 pts) - -.. Update the function "get_all_books" from project 11. After you get the information about books from from website, populate the "books" table with that data. You may need to scrap new data for a new field of "category" that the book belongs to. -.. Run a couple of queries that demonstrate that the data was successfully inserted into the database. - -[TIP] -==== -In project 11, we used an associate array to hold the book_info like this: - -[source,python] -book_info = { - ....book = { - "title":title, - "price":price, - "category":category_name, - "availability":availability -} -} - -We may need to use a different data structure like tuple to hold book information since we need to insert it to the books table, like this: -[source,python] -book_info =(title,price,category_name,availability) -==== -[TIP] -==== -Here is partial code to assist. - -[source,python] ----- -import sqlite3 - -... -# code to get book information -book_info =(title,price,category_name,availability) -# connect to database -conn = sqlite3.connect('onlinebooks.db') -cur = conn.cursor() -for article in articles: - cur.execute('INSERT INTO BOOKS (title,category,price,availability) VALUES (?,?,?,?)',book_info) -conn.commit() -conn.close() ----- -==== - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb` -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project13.adoc deleted file mode 100644 index f9eab0849..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project13.adoc +++ /dev/null @@ -1,216 +0,0 @@ -= TDM 40100: Project 13 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. This project we will learn some basics of containerization in a virtual environment using Alpine Linux. We first will start a virtual machine on Anvil, then create a simple container in the virtual machine. You may find more information about container and relationship between virtual machine and container here: https://www.redhat.com/en/topics/containers/whats-a-linux-container - -**Context:** The project is to provide very foundational knowledge about containers and virtualization, focusing on theoretical understanding and basic system interactions. - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use UNIX tools to effectively create a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (1 pt) - -[loweralpha] - -.. Logon to Anvil and use a bash command to find an available port you may use later. You only need to list 1 available port number. - -[TIP] -==== -- You may use the following code to find a port in the range 1025 to 65535. -- You may use a loop around the following code, to find an available port (instead of manually trying one by one), or you can find an available port in a different way, if you prefer. -[source, bash] ----- -if timeout 2 bash -c ">&/dev/tcp/127.0.0.1/1025" 2>/dev/null; then - echo "Port used" -else - echo "Port available" ----- -==== - -=== Question 2 (2 pts) - -.. Launch a virtual machine (VM) on Anvil. (Note that Docker is already pre-installed on Anvil.) Submit the output showing the job id and process id, after you start a virtual machine; it should look like this, for example; - -[source,bash] ----- -.output -[1] 3152048 ----- - -[NOTE] -==== -The most popular containerization tool at the time of writing is likely Docker. We will Launch a virtual machine on Anvil (which already has Docker pre-installed). - -Open up a terminal on Anvil. You may do it from within Jupyter Lab. Run the following code, to ensure that the SLURM environment variables don't alter or effect our SLURM job. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Next, let's make a copy of a pre-made operating system image. This image has Alpine Linux and a few basic tools installed, including: nano, vim, emacs, and Docker. - -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/builder.qcow2 $SCRATCH ----- - -Next, we want to acquire enough resources (CPU and memory) to not have to worry about something not working. To do this, we will use SLURM to launch a job with 4 cores and about 8GB of memory. - -[source,bash] ----- -salloc -A cis220051 -p shared -n 4 -c 1 -t 04:00:00 ----- - -Next, we need to make `qemu` available to our shell. Open a terminal and run the following code - -[source,bash] ----- -module load qemu -# check the module loaded -module list ----- - -Next, let's launch our virtual machine with about 8GB of memory and 4 cores. Replace the "[port]" with the port number that you got from question 1. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/builder.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::[port]-:22 & ----- - -[IMPORTANT] -==== -- [port] needs to be replaced with your port number -==== - -Next, it is time to connect to our virtual machine. We will use `ssh` to do this. - -[source,bash] ----- -ssh -p [port] tdm@localhost -o StrictHostKeyChecking=no ----- - -If the command fails, try waiting a minute and rerunning the command -- it may take a minute for the virtual machine to boot up. - -When prompted for a password, enter `purdue`. Your username is `tdm` and password is `purdue`. - -Finally, now that you have a shell in your virtual machine, you can do anything you want! You have superuser permissions within your virtual machine! -For this question, submit a screenshot showing the output of `hostname` from within your virtual machine! - -==== - - -=== Question 3 (1 pt) - -.. Exploring the virtual machine File System. Navigate the Alpine Linux file system and list the contents of the root directory. -.. List all running processes in the system -.. Display network configuration and test network connectivity. -[TIP] -==== -- You may refer to the following sample code or create your own ones - -[source, bash] ----- -ls / # list all root files ----- - -[source, bash] ----- -ps aux # system running processes ----- - -[source,bash] ----- -ifconfig # network interface configuration ----- - -==== - -=== Question 4 (2 pts) -.. Write and execute a simple shell script to display a message, like - -[TIP] -==== -[source, bash] ----- -echo 'Hello Your name, You are the Best!!!' > hello.sh ----- - -- run the shell script - -[source, bash] ----- -chmod +x hello.sh -./hello.sh ----- -==== - - -=== Question 5 (2 pts) - -After you complete the previous questions, you can see that you can use the virtual machine just like your own computer. Now let us follow the following step, to use Docker within the virtual machine to create and manage a container. Run all the commands in your terminal, copy the output to your jupyter notebook cells. - -.. List the docker version inside the virtual machine -[source, bash] ----- -docker --version ----- - -.. Pull the "hello-world" image from Docker Hub - -[source, bash] ----- -docker pull hello-world ----- - -..Run a container based on the "hello-world" image - -[source, bash] ----- -docker run hello-world ----- - -[NOTE] -==== -When the command runs, docker will create a container from the 'hello-world' image and run it. The container will display a message confirming that everything worked, and then it will exit. -==== - -.. List the container(s) with following command. It will provide you all the containers that are currently running or that exited already. -[source, bash] ----- -docker ps -a ----- - -.. After you confirm the container ran successfully, you may using following command to remove it - -[source, bash] ----- -docker rm [Container_id] ----- - -[TIP] -==== -Replace [Container_id] with the id that you got from previous question. -==== - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project14.adoc deleted file mode 100644 index b7cccbcb7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-project14.adoc +++ /dev/null @@ -1,53 +0,0 @@ -= TDM 40100: Project 14 -- Fall 2023 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** Unix, SQLite, R, Python, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, modules, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, module, or package. - - -=== Question 3 (1 pt) - -.. Reflecting on data visualization questions that you have done, which one do you consider most appealing? Which specific package did you use to create it? Please provide an example from one question that you completed. You may refer to the question, and screenshot your graph. - -=== Question 4 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 5 (1 pt) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 6 (2 pts) - -.. Please identify 3 skills or topics related to data science that you want to learn. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-projects.adoc deleted file mode 100644 index c7e7d8aa8..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/40100/40100-2023-projects.adoc +++ /dev/null @@ -1,45 +0,0 @@ -= TDM 40100 - -xref:fall2023/logistics/office_hours_401.adoc[[.custom_button]#TDM 401 Office Hours#] -xref:fall2023/logistics/401_TAs.adoc[[.custom_button]#TDM 401 TAs#] -xref:fall2023/logistics/syllabus.adoc[[.custom_button]#Syllabus#] - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$40100-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2023/tdm40100[Sign Up] - -=== Link - -https://piazza.com/purdue/fall2023/tdm40100/home[Homepage] - -== Syllabus - -See xref:fall2023/logistics/syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/101_TAs.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/101_TAs.adoc deleted file mode 100644 index 1cac279cf..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/101_TAs.adoc +++ /dev/null @@ -1,57 +0,0 @@ -= TDM 101 T.A.s - Fall 2023 - -*Pramey Kabra*, Head TA: kabrap@purdue.edu, https://purdue-edu.zoom.us/j/8828914260 - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - - -== Student-facing T.A.s: - -[NOTE] -==== -You can find the office hours schedule for the student-facing TAs on the xref:fall2023/logistics/office_hours.adoc[*office hours*] page. -==== - -[%header,format=csv] -|=== -Name,Zoom Meeting Link -Adarsh Rao,https://purdue-edu.zoom.us/j/8107523485 -Ashwin Prasad,https://purdue-edu.zoom.us/j/3875699463 -Bharath Sadagopan,https://purdue-edu.zoom.us/j/3771770608 -Brennan Frank,https://purdue-edu.zoom.us/j/9248396753 -Crystal Mathew,https://purdue-edu.zoom.us/j/4328502936 -Chaewon Oh,https://purdue-edu.zoom.us/j/2057551901 -Daniel Lee,https://purdue-edu.zoom.us/j/5520266708 -Derek Sun,https://purdue-edu.zoom.us/j/9929033218 -Hpung San Aung,https://purdue-edu.zoom.us/j/2137482636 -Jackson Fair,https://purdue-edu.zoom.us/j/2596138255 -Minsoo Oh,https://purdue-edu.zoom.us/j/5636658112 -Nihar Atri,https://purdue-edu.zoom.us/my/niharatri -Rhea Pahuja,https://purdue-edu.zoom.us/my/rheapahuja -Sabharinath Saravanan,https://purdue-edu.zoom.us/j/3822379850 -Samhitha Mupharaphu,https://purdue-edu.zoom.us/j/2125099079 -Sanjhee Gupta,https://purdue-edu.zoom.us/j/99124158496 -Sharan Sivakumar,https://purdue-edu.zoom.us/j/9714731980 -Shree Krishna Tulasi Bavana,https://purdue-edu.zoom.us/j/3910698373 -Shreya Ippili,https://purdue-edu.zoom.us/s/7538548710 -Shrinivas Venkatesan,https://purdue-edu.zoom.us/j/7671793454 -Vivek Chudasama,https://purdue-edu.zoom.us/j/5629339720 -Yifei Jin,https://purdue-edu.zoom.us/j/6469143597 - -|=== - -== Graders: - -- Tushar Singh -- Dheeraj Namargomala -- Mridhula Srinivasan -- Matthew Pak -- Shriya Gupta -- Connor Barnsley -- David Martin Calalang -- Gaurav Singh \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/201_TAs.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/201_TAs.adoc deleted file mode 100644 index fc4c1183d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/201_TAs.adoc +++ /dev/null @@ -1,35 +0,0 @@ -= TDM 201 T.A.s - Fall 2023 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - - -== Student-facing T.A.s: - -[NOTE] -==== -You can find the office hours schedule for the student-facing TAs on the xref:fall2023/logistics/office_hours.adoc[*office hours*] page. -==== - -[%header,format=csv] -|=== -Name,Zoom Meeting Link -Ananya Goel,https://purdue-edu.zoom.us/j/2256232958 -Dhruv Shah,https://purdue-edu.zoom.us/j/7247613499 -Jackson Fair,https://purdue-edu.zoom.us/j/2596138255 -Joseph Lee,https://purdue-edu.zoom.us/my/joseolee -Nikhil Saxena,https://purdue-edu.zoom.us/j/5711628431 - -|=== - -== Graders: - -- Jack Secor -- Tong En Sim (Nicole) -- Tori Donoho diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/301_TAs.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/301_TAs.adoc deleted file mode 100644 index 491f70638..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/301_TAs.adoc +++ /dev/null @@ -1,28 +0,0 @@ -= TDM 301 T.A.s - Fall 2023 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - - -== Grader + Student-facing T.A.s: - -[NOTE] -==== -You can find the office hours schedule for the student-facing TAs on the xref:fall2023/logistics/office_hours.adoc[*office hours*] page. -==== - -[%header,format=csv] -|=== -Name,Zoom Meeting Link -Aditya Bhoota,https://purdue-edu.zoom.us/j/2299552087 -Ankush Maheshwari,https://purdue-edu.zoom.us/j/2189997502 -Brian Fernando,https://purdue-edu.zoom.us/j/3580310418 -Jackson Fair,https://purdue-edu.zoom.us/j/2596138255 - -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/401_TAs.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/401_TAs.adoc deleted file mode 100644 index 869381d75..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/401_TAs.adoc +++ /dev/null @@ -1,24 +0,0 @@ -= TDM 401 T.A.s - Fall 2023 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - -== Grader + Student-facing T.A.s: - -[NOTE] -==== -You can find the office hours schedule for the student-facing TAs on the xref:fall2023/logistics/office_hours.adoc[*office hours*] page. -==== - -[%header,format=csv] -|=== -Name,Zoom Meeting Link -Jackson Fair,https://purdue-edu.zoom.us/j/2596138255 - -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours.adoc deleted file mode 100644 index 36258efa6..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours.adoc +++ /dev/null @@ -1,11 +0,0 @@ -= TA Office Hours - -Please select the level you are in to see the office hours schedule for Fall 2023. - -xref:fall2023/logistics/office_hours_101.adoc[[.custom_button]#TDM 101#] - -xref:fall2023/logistics/office_hours_201.adoc[[.custom_button]#TDM 201#] - -xref:fall2023/logistics/office_hours_301.adoc[[.custom_button]#TDM 301#] - -xref:fall2023/logistics/office_hours_401.adoc[[.custom_button]#TDM 401#] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_101.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_101.adoc deleted file mode 100644 index 827f5f678..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_101.adoc +++ /dev/null @@ -1,14 +0,0 @@ -= TA Office Hours - TDM 10100 - -Office hours locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 -- **Office hours _after_ 5:00 PM EST:** Online in Zoom + -- **Office hours on the _weekend_:** Online in Zoom - -[NOTE] -==== -You can find Zoom room links for the T.A.s on the xref:fall2023/logistics/ta_teams.adoc[*T.A. Teams*] page. -==== - -image::office_hours_101.png[TDM 101 Office Hours] diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_201.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_201.adoc deleted file mode 100644 index 902e2defe..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_201.adoc +++ /dev/null @@ -1,14 +0,0 @@ -= TA Office Hours - TDM 20100 - -Office hour locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 -- **Office hours _after_ 5:00 PM EST:** Online in Zoom + -- **Office hours on the _weekend_:** Online in Zoom - -[NOTE] -==== -You can find Zoom room links for the T.A.s on the xref:fall2023/logistics/ta_teams.adoc[*T.A. Teams*] page. -==== - -image::office_hours_201.png[TDM 201 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_301.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_301.adoc deleted file mode 100644 index a495926a0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_301.adoc +++ /dev/null @@ -1,19 +0,0 @@ -= TA Office Hours - TDM 30100 - -Office hour locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 -- **Office hours _after_ 5:00 PM EST:** Online in Zoom + -- **Office hours on the _weekend_:** Online in Zoom - -[NOTE] -==== -You can find Zoom room links for the T.A.s on the xref:fall2023/logistics/ta_teams.adoc[*T.A. Teams*] page. -==== - -[NOTE] -==== -Jackson's office hours will be held online regularly, but can be made in-person by request only. Please send an email to fairj@purdue.edu to request an in-person meeting. -==== - -image::office_hours_301.png[TDM 301 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_401.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_401.adoc deleted file mode 100644 index ba98ddfea..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/office_hours_401.adoc +++ /dev/null @@ -1,19 +0,0 @@ -= TA Office Hours - TDM 40100 - -Office hour locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 -- **Office hours _after_ 5:00 PM EST:** Online in Zoom + -- **Office hours on the _weekend_:** Online in Zoom - -[NOTE] -==== -You can find Zoom room links for the T.A.s on the xref:fall2023/logistics/ta_teams.adoc[*T.A. Teams*] page. -==== - -[NOTE] -==== -Jackson's office hours will be held online regularly, but can be made in-person by request only. Please send an email to fairj@purdue.edu to request an in-person meeting. -==== - -image::office_hours_401.png[TDM 401 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/schedule.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/schedule.adoc deleted file mode 100644 index 6a4a5dc46..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/schedule.adoc +++ /dev/null @@ -1,28 +0,0 @@ -= Course Schedule - Fall 2023 - -Below is the course schedule and includes release and due dates for syllabus and academic integrity quizzes, weekly projects, and outside events. - -[%header,format=csv] -|=== -Project,Release date,Due date -Syllabus Quiz,Aug 20,Sep 1 -Academic Integrity Quiz,Aug 20,Sep 1 -Project 1,Aug 21,Sep 1 -Project 2,Aug 24,Sep 1 -Project 3,Aug 31,Sep 8 -Project 4,Sep 7,Sep 15 -Outside Event 1,Aug 21,Sep 15 -Project 5,Sep 14,Sep 22 -Project 6,Sep 21,Sep 29 -Project 7,Sep 28,Oct 6 -Outside Event 2,Aug 21,Oct 6 -Project 8,Oct 5,Oct 20 -Project 9,Oct 19,Oct 27 -Project 10,Oct 26,Nov 3 -Project 11,Nov 2,Nov 10 -Outside Event 3,Aug 21,Nov 10 -Project 12,Nov 9,Nov 17 -Project 13,Nov 16,Dec 1 -Project 14,Nov 30,Dec 8 - -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus.adoc deleted file mode 100644 index 077b86467..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus.adoc +++ /dev/null @@ -1,293 +0,0 @@ -= Fall 2023 Syllabus - The Data Mine Seminar - -== Course Information - - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10100 - The Data Mine I, possible CRNs are 12067 or 12072 or 12073 or 12071 -TDM 20100 - The Data Mine III, possible CRNs are 12117 or 12106 or 12113 or 12118 -TDM 30100 - The Data Mine V, possible CRNs are 12104 or 12112 or 12115 or 12120 -TDM 40100 - The Data Mine VII, possible CRNs are 12103 or 12111 or 12114 or 12119 -TDM 50100 - The Data Mine Seminar, possible CRNs are 15644 or 30617 or 30618 or 30619 -|=== - -*Course credit hours:* 1 Credit hour, so you should expect to spend about 3 hours per week doing work for the class - -*Prerequisites:* -None for TDM 10100. All students, regardless of background are welcome. Typically, students new to The Data Mine sign up for TDM 10100, students in their second, third, or fourth years of The Data Mine sign up for TDM 20100, TDM 30100, and TDM 40100, respectively. TDM 50100 is geared toward graduate students. However, during the first week of the semester only, if a student new to The Data Mine has several years of data science experience and would prefer to switch from TDM 10100 to TDM 20100, we can make adjustments on an individual basis. - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within these pages! -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace -- link:https://piazza.com[*Piazza*] - Online Q/A Forum -- link:https://datamine.purdue.edu[*The Data Mine's website*] - Helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium--no meal swipe required), and 4:30 pm (link:https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later). Attendance is not required. - -For National Data Mine Network and Indiana Data Mine Network students, we have a dedicated (link:https://purdue-edu.zoom.us/my/mdward[online]) meeting time for you with Dr Ward Mondays at 4:00pm ET - -All the information you need to work on the projects each week will be provided online on the *Thursday* of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings, but this is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major and from any background who want to learn data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -1. Discover data science and professional development opportunities in order to prepare for a career. -2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -3. Design efficient search strategies in order to acquire new data science skills. -4. Devise the most appropriate data science strategy in order to answer a research question. -5. Apply data science techniques in order to answer a research question about a big data set. - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Access to Brightspace, Gradescope, and Piazza course pages. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - -=== Attendance Policy - -Attendance is not required. - -When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. - -For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website. - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don't wait until the due date to start! -* Be excited to challenge yourself and learn impressive new skills. Don't get discouraged if something is difficult--you're here because you want to learn, not because you already know everything! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don't see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don't forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you -should keep a copy of the course schedule where you see it easily. -* Read your emails! - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title -Shared email we all read, datamine-help@purdue.edu -Kevin Amstutz, Senior Data Scientist -Donald Barnes, Guest Relations Administrator -Maggie Betz, Managing Director of Corporate Partnerships -Kimmie Casale, ASL Tutor -Cai Chen, Corporate Partners Technical Specialist -Doug Crabill, Senior Data Scientist -Lauren Dalder, Corporate Partners Advisor -Stacey Dunderman, Program Administration Specialist -David Glass, Managing Director of Data Science -Betsy Hillery, Business Development Administrator -Emily Hoeing, Corporate Partners Advisor -Jessica Jud, Senior Manager of Expansion Operations -Kali Lacy, Associate Research Engineer -Gloria Lenfestey, Research Development Administrator -Nicholas Lenfestey, Corporate Partners Technical Specialist -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator -Kim Rechkemmer, Senior Program Administration Specialist -Nick Rosenorn, Corporate Partners Technical Specialist -Katie Sanders, Operations Manager -Betsy Satchell, Senior Administrative Assistant -Dr. Rebecca Sharples, Managing Director of Academic Programs and Outreach -Dr. Mark Daniel Ward, Director -Josh Winchester, Data Science Technical Specialist -Cindy Zhou, Senior Data Science Instructional Specialist - -|=== - -The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses. - -*For the purposes of getting help with this seminar class, your most important people are:* - -* *T.A.s*: Check who your T.A.s are on the xref:fall2023/logistics/ta_teams.adoc[T.A. Teams] page. Visit their xref:fall2023/logistics/office_hours.adoc[office hours] or use the link:https://piazza.com/[Piazza forum] to get in touch. -* *Dr. Mark Daniel Ward*, Director: Dr. Ward responds to questions on Piazza faster than by email - - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest response by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:fall2023/logistics/office_hours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available schedule. - -=== Piazza - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes! - -[NOTE] -==== -Use the link below to give your favorite seminar T.A.s a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - - -== Assignments and Grades - -=== Course Schedule & Due Dates - -xref:fall2023/logistics/schedule.adoc[Click here to view the Fall 2023 Course Schedule] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the following Friday at 11:59 pm Purdue West Lafayette (Eastern) time. This semester, there are 14 weekly projects, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won't hurt your grade. - -We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren't done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. *Try to start your projects on or before Monday each week to leave yourself time to ask questions.* - -Outside of projects, you will also complete 3 Outside Event reflections. More information about these is in the "Outside Event Reflections" section below. - -The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the link:https://catalog.purdue.edu/content.php?catoid=15&navoid=18634#academic-calendar[Academic Calendar]. - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 14 projects available over the semester, and your best 10 will count. -* The 4 project grades that are dropped could be from illnesses, absences, travel, family -emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the -last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office -hours. We're waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in -Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the -comment sections of your project submission. You need to say who helped you and how they -helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be -the topic of the seminar and any office hours that week, and then the project will be due by -11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature -inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. Feel free to complete them early. -** Outside Event Reflections *must* be submitted within 1 week of attending the event or watching the recording. -** At least one of these events should by on the topic of Professional Development (designated by "PD" on the schedule) -* Find outside events posted on The Data Mine's website (https://datamine.purdue.edu/events/) and updated frequently. Let us know about any good events you hear about. -* Format of Outside Events: -** Often in person so you can interact with the presenter! -** Occasionally online and possibly recorded -* Follow the instructions in Gradescope for writing and submitting these reflections. -*** Name of the event and speaker -*** The time and date of the event -*** What was discussed at the event -*** What you learned from it -*** What new ideas you would like to explore as a result of what you learned at the event -*** AND what question(s) you would like to ask the presenter if you met them at an after-presentation reception. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - -=== Late Work Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can't do this if we are waiting for other students to turn their work in. - -=== Grade Distribution - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - -=== Grading Scale -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+. - -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10100/20100/30100/40100/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who -have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody's completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I'll do #2, and he'll do #3: then we'll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an -unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else's computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else's answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else's outside event reflection rather than attending the event yourself. - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - -=== xref:fall2023/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources] - -=== Disclaimer -This syllabus is subject to small changes. All questions and feedback are always welcome! - -// Includes: - -// * xref:fall2023/syllabus_purdue_policies.adoc#Class Behavior[Class Behavior] -// * xref:fall2023/syllabus_purdue_policies.adoc#Nondiscrimination Statement[Nondiscrimination Statement] -// * xref:fall2023/syllabus_purdue_policies.adoc#Students with Disabilities[Students with Disabilities] -// * xref:fall2023/syllabus_purdue_policies.adoc#Mental Health Resources[Mental Health Resources] -// * xref:fall2023/syllabus_purdue_policies.adoc#Violent Behavior Policy[Violent Behavior Policy] -// * xref:fall2023/syllabus_purdue_policies.adoc#Diversity and Inclusion Statement[Diversity and Inclusion Statement] -// * xref:fall2023/syllabus_purdue_policies.adoc#Basic Needs Security Resources[Basic Needs Security Resources] -// * xref:fall2023/syllabus_purdue_policies.adoc#Course Evaluation[Course Evaluation] -// * xref:fall2023/syllabus_purdue_policies.adoc#General Classroom Guidance Regarding Protect Purdue[General Classroom Guidance Regarding Protect Purdue] -// * xref:fall2023/syllabus_purdue_policies.adoc#Campus Emergencies[Campus Emergencies] -// * xref:fall2023/syllabus_purdue_policies.adoc#Illness and other student emergencies[Illness and other student emergencies] -// * xref:fall2023/syllabus_purdue_policies.adoc#Disclaimer[Disclaimer] diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus_purdue_policies.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus_purdue_policies.adoc deleted file mode 100644 index 2ed603182..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/syllabus_purdue_policies.adoc +++ /dev/null @@ -1,88 +0,0 @@ -== Purdue Policies & Resources - -=== Class Behavior - -You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties--this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group. - - -*Purdue's Copyrighted Materials Policy:* - -Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally. -Notes taken in class are, however, generally considered to be "derivative works" of the instructor's presentations and materials, and they are thus subject to the instructor's copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose. - -=== Nondiscrimination Statement -Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue's nondiscrimination policy statement.] - -=== Students with Disabilities -Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247. - -If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you. - -=== Mental Health Resources - -* *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time. -* *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm. -* *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu. -* *If you're struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours. - -=== Violent Behavior Policy - -Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University's full violent behavior policy] for more detail. - -=== Diversity and Inclusion Statement - -In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points: - -* We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable. -* Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker's intention. -* We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves. - -=== Basic Needs Security Resources - -Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 - 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community). - -Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund]. - -=== Course Evaluation - -During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system. - -You may email feedback to us anytime at link:mailto:datamine-help@purdue.edu[datamine-help@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you! - -=== General Classroom Guidance Regarding Protect Purdue - -Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:https://purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace. - -=== Campus Emergencies - -In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course: - -* Brightspace or by e-mail from Data Mine staff. -* General information about a campus emergency can be found on the Purdue website: xref:www.purdue.edu[]. - - -=== Illness and other student emergencies - -Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students. - -*Official Purdue University links to Resources and Guidelines:* - -=== University Policies and Statements - -- link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[Academic Integrity] -- link:https://www.purdue.edu/purdue/ea_eou_statement.php[Nondiscrimination Policy Statement] -- link:https://www.purdue.edu/advocacy/students/absences.html[Class Absences] -- link:https://catalog.purdue.edu/content.php?catoid=15&navoid=18634#classes[Attendance] -- link:https://www.purdue.edu/policies/ethics/iiia1.html[Amourous Relationships] -- link:https://www.purdue.edu/ehps/emergency-preparedness/[Emergency Preparedness] -- link:https://www.purdue.edu/policies/facilities-safety/iva3.html[Violent Behavior] -- link:https://www.purdue.edu/policies/academic-research-affairs/ia3.html[Use of Copyrighted Materials] - -=== Student Support and Resources - -- link:https://www.purdue.edu/asc/resources/get-engaged.html[Engage In Your Learning] -- link:https://www.purdue.edu/policies/information-technology/s5.html[Purdue's Web Accessibility Policy] -- link:https://www.d2l.com/accessibility/standards/[Accessibility Standard in Brightspace] - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/ta_teams.adoc b/projects-appendix/modules/ROOT/pages/fall2023/logistics/ta_teams.adoc deleted file mode 100644 index 1876596a9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/ta_teams.adoc +++ /dev/null @@ -1,20 +0,0 @@ -= T.A. Teams - -Please select the level you are in to see your T.A.s for Fall 2023. - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - -xref:fall2023/logistics/101_TAs.adoc[[.custom_button]#TDM 101#] - -xref:fall2023/logistics/201_TAs.adoc[[.custom_button]#TDM 201#] - -xref:fall2023/logistics/301_TAs.adoc[[.custom_button]#TDM 301#] - -xref:fall2023/logistics/401_TAs.adoc[[.custom_button]#TDM 401#] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2023/logistics/timecomplexities.txt b/projects-appendix/modules/ROOT/pages/fall2023/logistics/timecomplexities.txt deleted file mode 100644 index d4842c9df..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2023/logistics/timecomplexities.txt +++ /dev/null @@ -1,57 +0,0 @@ -=== Question 5 (2 pts) -[upperalpha] -.. Write a list of the big O time complexities for each of the functions below. -.. Write 2-3 sentences explaining why it is difficult to determine the time complexity of a function without knowing the time complexities of the sub-functions inside it. - -Let's briefly discuss time complexity, a concept of huge importance in computer science. While an in-depth discussion of time complexity is outside the scope of this course, a working understanding of its core concepts is an invaluable skill to have in this modern age. - -Time complexity, at heart, is a measure of the efficiency of our program. This may be best described through an example, but if you feel the need for a more formal explanation, https://towardsdatascience.com/understanding-time-complexity-with-python-examples-2bda6e8158a7[here] is some reading that may appeal to you. - -Now for our example. Let's say we create a function to sort a list of numbers in ascending order, via some arbitrary algorithm (there are many that would work). When we give the function a list of 4 numbers to sort, it takes 8 seconds. Then, when we double the size of the list the amount of time doubles (8 numbers, 16 seconds). If we double the size again, the amount of time doubles again (16 numbers, 32 seconds). Because the amount of time the function takes to run is linearly correlated with the size of the list it is given, we say it runs in "linear time". If, however, doubling the size of the list instead quadrupled the amount of time it took to run, we would say it runs in "quadratic time", as the relationship between the size of the argument and the time it takes to run is quadratic. - -When people discuss time complexity, you will often hear them use terms like "Big/Little O", "Big/Little Theta", and "Big/Little Omega". These are all different ways of describing the time complexity of a function, and are all related to one another. For the purposes of this course, we will be using "Big O" exclusively, as it is the most common and easiest to understand. - -Big O time complexity is defined as the worst possible time complexity that the function can have, given _n_ arguments (This is an oversimplification of the rather complex mathematical definition of Big O). For example, if we have a function that takes a list of numbers and returns the largest number in the list, we would say that it has a time complexity of O(n), as the worst possible time complexity it could have is linear (if the largest number is the last number in the list, we have to go through the whole list before finding it). - -After reading the above example and article, feel free to do some more research on your own. Then, for each of the below functions, give the Big O time complexity of each function. If you are unsure, feel free to ask on Piazza, or do some more research on your own. If you are still unsure, make your best guess and explain your reasoning. - -[NOTE] -==== -You may assume that any sub-functions used within a function have a constant time complextiy (i.e. O(1)), and thus will not affect the overall time complexity of the function. -==== - -[source,ipython] ----- -def find_first_three(num_list): - return num_list[0], num_list[1], num_list[2] ----- - -[source,ipython] ----- -def find_biggest(num_list): - biggest = -999 - for i in range(0,len(num_list)): - if num_list[i] > biggest: - biggest = num_list[i] - return biggest ----- - -[source,ipython] ----- -def double_looping(num_list): - for i in range(0,len(num_list)): - for j in range(0,len(num_list)): - print(f"Loop {i}: {num_list[j]}") - return ----- - -For the last portion of this question, think about what happens when a function calls a sub-function and how this might affect time complexity. Sub-functions also have a time complexity. How does this factor into the overall complexity of the main function? In order to get credit for this portion, write 2-3 sentences explaining why it is difficult to determine the time complexity of a function without knowing the time complexities of the sub-functions inside it. - -Going forward, try and keep the time complexity of the functions you are writing in mind as you write them. While this is not _as_ important in data science, as many of the functions you write are designed to be run only a few times, it can still help greatly improve the speed of your code, saving you time on your project (in addition to being a skill employers value!). - -.Items to submit -==== -- The time complexities of the 3 functions above. -- A 2+ sentence explanation about how the time complexity of a function is affected by the time complexities of the sub-functions it calls. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project1.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project1.adoc deleted file mode 100644 index 778907ffc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project1.adoc +++ /dev/null @@ -1,267 +0,0 @@ -= TDM 10100: R Project 1 -- 2024 - -**Motivation:** The goal of this project is to get you comfortable with the basics of operating in Jupyter notebooks as hosted on Anvil, our computing cluster. If you don't understand the code you're running/writing at this point in the course, that's okay! We are going to go into detail about how everything works in future projects. - -**Context:** You to not need any background to get started in The Data Mine. If you never programmed before, that is totally OK! If you never worked with data before, that is totally OK too! We will spend most of our time learning how to work with data, in a practical way. - -**Scope:** Anvil, Jupyter Labs, Jupyter Notebooks, R, bash - -.Learning Objectives: -**** -- Learn to create Jupyter notebooks -- Gain proficiency manipulating Jupyter notebook contents -- Learn how to upload/download files to/from Anvil -- Read data using R -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/airports.csv` -- `/anvil/projects/tdm/data/icecream/breyers/reviews.csv` -- `/anvil/projects/tdm/data/icecream/bj/products.csv` - -== Questions - -=== Question 1 (2 pts) - -First and foremost, welcome to The Data Mine! We hope that throughout your journey with us, you learn a lot, make new friends, and develop skills that will help you with your future career. Throughout your time with The Data Mine, you will have plenty of resources available should you need help. By coming to weekly seminar, posting on the class Piazza page, and joining Dr. Ward and the TA team's office hours, you can ensure that you always have the support you need to succeed in this course. - -The links to Piazza are: - -https://piazza.com/class/lwyrxitz6bj3gy[TDM 10100 Piazza link] - -https://piazza.com/class/lwys5syg79ywu[TDM 20100 Piazza link] - -https://piazza.com/class/lwys6tkokqq1in[TDM 30100 Piazza link] - -https://piazza.com/class/lwys7dwijm11um[TDM 40100 Piazza link] - -Dr Ward is also available on Monday mornings in the Hillenbrand dining court from 8:30 AM to 11:20 AM. He is also available on Monday afternoons during 4:30 PM to 5:20 PM on Zoom at https://purdue-edu.zoom.us/my/mdward/[https://purdue-edu.zoom.us/my/mdward/] - -[IMPORTANT] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/setup[https://the-examples-book.com/setup] If you are still having issues with your ACCESS ID, please send an email containing as much information as possible about your issue to datamine-help@purdue.edu -==== - -Let's start off by starting up our first Jupyter session on https://www.rcac.purdue.edu/compute/anvil[Anvil]! We always use the URL https://ondemand.anvil.rcac.purdue.edu[https://ondemand.anvil.rcac.purdue.edu] and the ACCESS username that you were assigned (when you setup your account) and the ACCESS password that you chose. These are *NOT* the same as your Purdue account! - -[IMPORTANT] -==== -These credentials are not the same as your Purdue account. -==== - -In the upper-middle part of your screen, you should see a dropdown button labeled `The Data Mine`. Click on it, then select `Jupyter Notebook` from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this: - -image::f24-101-p1-1.png[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -If your screen doesn't look like this, please try and select the correct dropdown option again or ask one of the TAs during seminar for more assistance. - -[NOTE] -==== -Jupyter Lab and Jupyter Notebook are _technically_ different things, but in the context of seminar we will refer to them interchangeably to represent the tools you'll be using to work on your projects. -==== - -There are a few key parts of this screen to note: - -- Allocation: this should always be cis220051 for The Data Mine -- Queue: again, this should stay on the default option `shared` unless otherwise noted. -- Time in Hours: The amount of time your Jupyter session will last. When this runs out, you'll need to start a new session. It may be tempting to set it to the maximum, but our computing cluster is a shared resource. This means every hour you use is an hour someone else can't use, so please *ONLY* reserve it for 1-2 hours at a time. -- CPU cores: CPU cores do the computation for the programs we run. It may be tempting to request a large number of CPU cores and set it to the maximum, but our computing cluster is a shared resource. This means every computational core that you use is one that someone else can't use. We only have a limited number of cores assigned to our team, so please *ONLY* reserve 1 or 2 cores at a time, unless the project needs more cores. - -[IMPORTANT] -==== -The Jupyter Lab environment will save your work at regular intervals, so that at the end of a session, your work should be automatically saved. Nonetheless, you can select File from the menu and Save Notebook any time that you want. (It is not necessarily, because the notebooks save automatically, but you can still save anytime if you want to.) -==== - -With the key parts of this screen explained, go ahead and select 1 hour of time with 1 CPU cores and click Launch! After a bit of waiting, you should see something like below. Click connect to Jupyter and proceed to the next question! - -image::f24-101-p1-2.png[Launch Jupyter Lab, width=792, height=500, loading=lazy, title="Launch Jupyter Lab"] - -[IMPORTANT] -==== -You likely noticed a short wait before your Jupyter session launched. This happens while Anvil finds and allocates space for you to work. The more students are working on Anvil, the longer this will take, so it is our suggesting to start your projects early during the week to avoid any last-minute hiccups causing a missed deadline. *Please do not wait until Fridays to complete and submit your work!* -==== - -Download the project template, as described here: https://the-examples-book.com/projects/templates[https://the-examples-book.com/projects/templates] - -We give some information about kernels here: https://the-examples-book.com/projects/kernels - -For the first question in this project, let's try the first example from the kernel page: We will load the airports data set in Python and will display the head of the airports data set. (Most of our focus will be on R this semester, but we wanted to demonstrate one example in Python.) - -[source, python] ----- -import pandas as pd -myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -myDF.head() ----- - -Just try this Python code using the `seminar` kernel (not the `seminar-r` kernel) and make sure that you can see the first five rows of the airports data frame. - -++++ - -++++ - -++++ - -++++ - -.Deliverables -==== -- Use Python to show the output with the first five rows of the airports data frame. -- Be sure to document your work from Question 1, using some comments and insights about your work. -==== - -=== Question 2 (2 pts) - -As you continue to get comfortable with Jupyter Lab, you might want to https://the-examples-book.com/starter-guides/tools-and-standards/jupyter[read more about Jupyter Lab] (this is optional). We want you to get comfortable with switching kernels in Jupyter Lab when needed. The different options that you see (like the `seminar` kernel and the `seminar-r` kernel) in the upper right hand of the screen https://the-examples-book.com/projects/kernels[are called kernels] (please read the kernel documentation; this is the same as the documentation from Question 1). - -When you first open the template, you may get a pop-up asking you to select what kernel you'll be using. Select the `seminar` kernel (not the `seminar-r` kernel). If you do not get this pop-up, you can also select a kernel by clicking on the upper right part of your screen that likely says something similar to `No Kernel`, and then selecting the kernel you want to use. - -Use the `seminar` kernel with R, and `%%R` cell magic, to (again) display the first six lines of the airports data frame, but this time in R: - -[source,R] ----- -%%R -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -head(myDF) ----- - -Now do this again, using the `seminar-r` kernel with R, and notice that you do *NOT* need the `%%R` cell magic with the `seminar-r` kernel. You can do all of this in the same Jupyter Lab notebook, just by changing the kernel. - -[source,R] ----- -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -head(myDF) ----- - -You have now loaded the first six lines of the airports data frame in three ways (once in Question 1, and now a second and a third time in Question 2). - -A Jupyter notebook is made up of `cells`, which you can edit and then `run`. There are two types of cells we'll work in for this class: - -- Markdown cells. These are where your writing, titles, sections, and paragraphs will go. Double clicking a markdown cell puts it in `edit` mode, and then clicking the play button near the top of the screen runs the cell, which puts it in its formatted form. More on this in a second. For now, just recognize that most markdown looks like regular text with extra characters like `#`, `*`, and `-` to specify bolding, indentation font, size, and more! -- Code cells. These are where you will write and run all your code! Clicking the play button will run the code in that cell, and the programming language is specified by the language or languages known by the kernel that you chose. - -*For each question in The Data Mine*, please always be sure to put some comments after your cells, which describe all of the work that you are doing in the cells, as well as your thinking and insights about the results. - -[NOTE] -==== -Some common Jupyter notebooks shortcuts: - -- Instead of clicking the `play button`, you can press ctrl+enter (or cmd+enter on Mac) to run a cell. -- If you want to run a cell and then move immediately to the next cell, you can use shift+enter. This is oftentimes more useful than ctrl+enter -- If you want to run the current cell and then immediately create a new code cell below it, you can press alt+enter (or option+enter on Mac) to do so. -- When a cell is selected (this means you clicked next to it, and it should show a blue bar to its left to signify this), pressing the `d` key twice will delete that cell. -- When a cell is selected, pressing the `a` key will create a new code cell `a`bove the currently selected cell. -- When a cell is selected, pressing the `b` key will create a new code cell `b`elow the selected cell -==== - -++++ - -++++ - -++++ - -++++ - -.Deliverables -==== -- Use R to show the output with the first six rows of the airports data frame, and do this two ways: once using R with the `seminar` kernel, and once using R with the `seminar-r` kernel. -- Be sure to document your work from Question 2, using some comments and insights about your work. -==== - -=== Question 3 (2 pts) - -Which state has the largest number of airports? How many airports are located in that state? We can refer to one column of a data set by using the dollar sign and the name of the column. For instance, in the airports data set, the state where the airport is located is found in the column called `myDF$state`. The `table` function and the `sort` function can be helpful, for answering this question. - -We can get the counts of airports in each state: - -[source,R] ----- -table(myDF$state) ----- - -and we can put the results into numerical order: - -[source,R] ----- -sort(table(myDF$state)) ----- - -++++ - -++++ - - -.Deliverables -==== -- Use R to show how many airports are found in each state, first in alphabetical order (which is the default), and then in sorted order. You are welcome to work in R and use the `seminar-r` kernel. -- Be sure to document your work from Question 3, using some comments and insights about your work. -==== - -=== Question 4 (2 pts) - -In the ice cream products data set: - -`/anvil/projects/tdm/data/icecream/combined/products.csv` - -each product is represented just one time. How many times does each brand occur in the `products.csv` data set? - -In the ice cream reviews data set, on the other hand, there are thousands of reviews of each product: - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -How many times does each brand occur in the `reviews.csv` data set? - -You work will be similar to the work from Question 3. Be sure to document each question with comments about your work. - -++++ - -++++ - -.Deliverables -==== -- Use R to show how times each brand of ice cream appears, in each of the two files indicated above. -- Be sure to document your work from Question 4, using some comments and insights about your work. -==== - -=== Question 5 (2 pts) - -Using the `plot` command to display the number of times that each brand occurs in the ice cream `products.csv` data set from Question 4. - -Then make a second `plot` that displays the number of reviews for each brand in the ice cream `reviews.csv` data set. - -As always, be sure to document your work. - -++++ - -++++ - -.Deliverables -==== -- Use R to make two plots, illustrating how many times that each brand of ice cream appears in the two (respective) data sets with ice cream data. -- Be sure to document your work from Question 5, using some comments and insights about your work. -==== - -== Submitting your Work - -Congratulations! Assuming you've completed all the above questions, you've just finished your first project for TDM 10100! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, or during office hours. - -Prior to submitting your work, you need to put your work xref:templates.adoc[into the project template], and re-run all of the code in your Jupyter notebook and make sure that the results of running that code is visible in your template. Please check the xref:submissions.adoc[detailed instructions on how to ensure that your submission is formatted correctly]. To download your completed project, you can right-click on the file in the file explorer and click 'download'. - -Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don't lose any points. At the bottom of each 101 project, you will find a comprehensive list of all the files that need to be submitted for that project. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!! - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10-teachinglearning-backup.adoc deleted file mode 100644 index 7e8c0bcaa..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10-teachinglearning-backup.adoc +++ /dev/null @@ -1,231 +0,0 @@ -= TDM 10100: R Project 10 -- 2024 - -**Motivation:** In Project 9 we introduced the concept of outside libraries and packages in R, learned how to import a new library, and used some basic functions from libraries we imported. This project will serve as a direct extension to that where we will focus specifically on two of the most-used R libraries: `tidyr` and `dplyr` (pronounced dee-plier), the paradigms that come with each, and some commonly-used functions that they contain. - -**Context:** Libraries, import statements, and R syntax will be important to have a good understanding of for this project. - -**Scope:** dplyr, tidyr, libraries, R - -.Learning Objectives: -**** -- Learn what "tidy" data is and why its useful -- Use basic `tidyr` utilities to process and reformat data -- Learn about `dplyr` mutations -- Manipulate data and reimplement `apply()` processes using `dplyr` -- Combine `dplyr` and `tidyr` utilities to "make a dataframe tidy" -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/beer/beer.csv` - -== Questions - -=== Question 1 (2 pts) - -As you've likely observed throughout this semester, much of data science consists solely of creating new columns or reshaping data in such a way that it is easier to analyze, causing problems where lots of time is sunk purely into data pre-processing. `tidyr` puts forward an answer to this issue: a standarized way to reformat data so that pre-processing and data analysis is as hands-off and easy as possible. - -While we could continue to provide our own perception of what "tidy data" is, who better to explain it to you than the authors of tidyr? Read through https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html[this fantastic article] and, in a markdown cell, answer the following questions in your own words: - -. What is tidy data and why is it important? -. What is the difference between an observation and a variable? - -Additionally, read the data at `/anvil/projects/tdm/data/beer/beers.csv` into a new data.table called `beers_dat` (you may need to review the previous project for a reminder on how to use the data.table library). Print the head of that table. - -Finally, add to your existing markdown cell a few sentences (at least 1-2) detailing what the variables are in the provided data, what the observations are in the provided data, and any suggestions you can think of to make the data more "tidy". Don't make any of these changes, but take a few minutes to think about how you might restructure the data on your own. - -.Deliverables -==== -- A few sentences on what makes data "tidy" and why that matters -- A few sentences on the difference between an observation and a variable -- A few sentences detailing the variables, observations and how you could tidy the data you read in. -- A new data.table, `beers_dat`, created from the data in the specified `beers.csv` file -==== - -=== Question 2 (2 pts) - -One of the most common uses for `tidyr` is the reshaping and combination of rows and columns in our dataframes/data.tables. Let's take a look at some useful functions provided by `tidyr` and then apply them to make our data a bit more concise. - -Firstly, let's take a look at the https://www.rdocumentation.org/packages/tidyr/versions/0.8.2/topics/unite[`unite()`] function. `unite()` takes two columns and "unites" them, often with a separator character between the two that makes it easy to distinguish where the values for one column begin and the others end. Here's a short example: - -[source, R] ----- -library(tidyr) -# if I have two columns in a data.table called images_dat, color and brightness, and want to combine them so that each row is of the form color:brightness, I could do the following. - -unite(images_dat, "color_brightness", c("color","brightness"), sep=":") ----- - -[NOTE] -==== -The opposite of `unite()` is https://tidyr.tidyverse.org/reference/separate.html[`separate()`]. While we won't use it in this question, you may make a mistake and want to separate your columns. The notation is generally of a form similar to `separate(dat, "newcol", c("col1", "col2"), sep=";")`. -==== - -To solve this question, we want you to accomplish two main tasks: - -. Firstly, use `unite()` to merge the state and country columns into a new column called `state_column`, using an underscore "_" as the separator character -. Next, print the `head()` of your modified `beers_dat` table - -.Deliverables -==== -- The results of merging the state and country columns -==== - -=== Question 3 (2 pts) - -Another oft-used `tidyr` function is https://tidyr.tidyverse.org/reference/pivot_longer.html[pivot_longer()], which can be used to efficiently collapse multiple columns down to easy-to-evaluate key-value pairs. Please briefly review the previously linked documentation, paying special attention to the example provided towards the bottom of the page in order to best grasp how `pivot_longer()` is used. - -[NOTE] -==== -The opposite of `pivot_longer()` is https://tidyr.tidyverse.org/reference/pivot_wider.html[pivot_wider()]. Again, we won't be using it in this question, and its generally not used on tidy data as it typically contributes to "untidiness", but it is still a useful function to be aware of in case you ever want to reformat or drastically restructure your data. -==== - -. Let's look at our data's structure. Most of this data is devoted to attributes of the different beers, but there are three columns that deal with more brewery-specific details: `brewery_id`, `availability` and `retired`. Use pivot_longer on the `availability` and `retired` columns, pivoting so that there is a new column called `status_category` that contains the names (either "availability" or "retired"), and a new column called `status` that contains the values (e.g. "Rotating", "Year-round", "f"). -. Using https://www.statmethods.net/management/subset.html[`subset()`], create a new dataframe called `bar_logistics` that contains the `id` and `brewery_id` columns as identifiers, along with the `status_category` and `status` columns that you just created. Then use `pivot_wider` to restore the `status_category` and `status` columns to their separate `availability` and `retired` columns. -. Finally, drop `brewery_id`, `status_category`, and `status` from your original `beers_dat` dataframe, as we are now storing them in their own, separate table. (Hint: this can also be done using `subset()`) - -[IMPORTANT] -==== -If this isn't your first foray into the `tidyverse` (the group of libraries built around `tidyr`, curated largely by Dr. Hadley Wickham) you may be familiar with the https://tidyr.tidyverse.org/reference/gather.html[`gather()`] function and think that it largely resembles `pivot_longer()`. This is because `pivot_longer` is basically a newer, better form of `gather()`, and as it is encouraged to be used over `gather()`, its what we'll focus on here. -==== - -Finally, print the head of your resulting data and observe the results. You should see something similar to the following: - -image::f24-101-p10-1.png[Pivoted Data Table, width=792, height=500, loading=lazy, title="Pivoted Data Table"] - -[NOTE] -==== -Typically a pivot like is made because we wanted to separate our data. As you now know, modularity is a key concept when it comes to tidy data: different concepts and groups of data should be kept in separate tables. The motivations behind this are complex and many: it simplifies visual analysis to limit columns, makes operators that work on an entire table at once easier to use, and can help us to separate confidential data from non-confidential data. -==== - -[IMPORTANT] -==== -While this question is technically doable using different methods, only `pivot()` functions like `pivot_longer()` or `pivot_wider()` will be accepted for full credit. -==== - -.Deliverables -==== -- A new table `bar_logistics` containing `status_category`, `status`, `id`, and `beer_id` -- Your original `beers_dat` table, with the brewery-specific items dropped out -==== - -=== Question 4 (2 pts) - -Now, with our data a little more separated and easier to handle, we can move into our second library of focus for this project: https://dplyr.tidyverse.org/[`dplyr`]! Arguably the two most important utilities provided by `dplyr` are `group_by()` and `mutate`, although `filter()` and others are often commonly used. In the next two questions, we'll cover these three functions in detail. - -Before we dive into `group_by`, let's quickly cover something you likely already saw in documentation: `%>%` piping. Piping allows us to write cleaner code by taking the output of one function and then using that as the input to another function. Sounds simple enough, right? It can get pretty complicated, but as long as you break down each step piping is a fantastic tool. Take a look at the below example: - -[NOTE] -==== -One small thing to notice is that piping `%>%` is not actually part of the base R functions, and is instead part of the `dplyr` library. Be sure to import `dplyr` using `library(dplyr)` before attempting to use pipes in your project. -==== - -[source, R] ----- -library(dplyr) - -# prints hello world -"Hello World!" %>% print() - -# generate a list of 20 random numbers between 1-10, then print the mean -sample(1:10, 20, replace = TRUE) %>% - mean() %>% - cat( "is the average of our list") ----- - -[IMPORTANT] -==== -If you are piping input to a function that takes multiple arguments, any input you pipe in will be placed before other arguments you pass to the function (see above example) -==== - -`group_by()` is a rather on-the-nose name. It creates a "grouped table", which allows to commit operations over each group separately. For example, if we had a dataset of different cars and their gas mileage we could first group the cars by type of car (e.g. sedan, SUV, pickup) and then get the average gas mileage for each type of car. The possibilites here are dauntingly large, so we'll just cover some basic uses and provide you more space to develop your own methodologies in futures projects and questions. - -Below is an example where we take our `beers_dat` table, group by state-country pair, and then get the average abv for each pair (and sort highest to lowest). - -[source, R] ----- -library(dplyr) - -avg_abv_by_statecountry <- beers_dat %>% # pass in our beer data - group_by(state_country) %>% # groups by state-country pair - summarise(avg_abv_by_statecountry = mean(abv, na.rm = TRUE)) %>% # gets average abv for each group - arrange(desc(avg_abv_by_statecountry)) # sort by highest to lowest average abv - -# prints the head of our data -head(avg_abv_by_statecountry) ----- - -For this question, we'd like you to first group your `beers_dat` data by the `style` category and then get the average abv for each style. Your final answer should include both a table of the average values by style called `avg_abv_by_style`, and also a print statement of the top 5 styles by abv (this can just be a head of the `avg_abv_by_style` table, if you sorted it). - -If you're struggling with how to go about this problem, pay special care to the above example. It is **extremely** similar to what you're being asked to do here. - -[NOTE] -==== -In case you hadn't yet Googled it, _abv_ stands for _alcohol by volume_, and is a common metric of the strength of an alcohol. -==== - -.Deliverables -==== -- A new table called `avg_abv_by_style` -- The top 5 strongest styles of beer by ABV -==== - -=== Question 5 (2 pts) - -Your data-analysis skills have already increased tenfold since the beginning of this project, so we'll add just two more tools to your skillset to cap things off. - -Firstly, one of the most important utilities that `dplyr` provides: `mutate()`. `mutate()` allows for the easy manipulation of a function over entire columns of data at once, similar to `apply()` with slightly different (and, I'd argue better) syntax and utility. - -Secondly is the `filter()` function, which allows for the filtration of a dataframe into only rows that meet specific conditions. - -Read through and run on your own the below code, paying specific attention to the intermediate results returned by `mutate()` and `filter()` respectively. - -[source, r] ----- -# NOTE: This code builds off of the results of the example code provided in the previous problem - -# Merge average abv back into original dataframe -beers_dat <- merge(beers_dat, avg_abv_by_statecountry, by = "style", suffixes = c("", "_style")) - -# Calculate the mean difference for each beer -beers_dat <- beers_dat %>% - mutate(mean_difference = abv - avg_abv_by_statecountry) - -# Filter for beers with an abv at least 10 above the average for that style -filtered_data <- beers_dat %>% - filter(mean_difference >= 10) ----- - -To complete this question, we want you to create a new column in your dataset called mean_difference that is the difference in abv between each beer and the average abv for that style of beer. You can think of this in three steps. First, use `merge()` to merge your `avg_abv_by_style` table from the last question into your `beers_dat` as a new column using the same name (Hint: we did this in the last project). Then, use `mutate()` to create your new column, `mean_difference`, that is the difference in abv between each beer and the average abv for that style of beer. Next, use `filter()` to filter only for beers with an abv at least 50 above the average for that style. Finally, print the resulting filtered data. - -[NOTE] -==== -To check your answers, the resulting filtered data should contain 16 entries (viewable using `count()`). -==== - -.Deliverables -==== -- The head of your data, filtered to have beers with an abv at least 50 above its style average -==== - -== Submitting your Work - -In finishing this project, you've successfully learned and applied some of the most prominently used functions in R for data analysis. While different libraries are largely used based on preference, common ones like `data.table`, `tidyr`, and `dplyr` are so useful that relatively any working R professional is familiar with them on even just a basic level. - -In the next project we'll slow down massively and introduce the `tidyverse`, a collection of common libraries including both of the ones we introduced in this project. We will spend the majority of the project focusing on date-time analysis, an important and difficult part of handling data in R, and some common tools that can help us ingest and analyze date-time data in various formats. - -.Items to submit -==== -- firstname_lastname_project10.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10.adoc deleted file mode 100644 index 65b89164a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project10.adoc +++ /dev/null @@ -1,380 +0,0 @@ -= TDM 10100: R Project 10 -- 2024 - -**Motivation:** Using R enables us to apply functions to many data sets in an efficient way. We can extract information in a straightforward way. - -**Context:** There are several types of `apply` functions in R. In this project, we learn about the `sapply` function, which is a "simplified" apply function. - -**Scope:** Applying functions to data. - -.Learning Objectives: -**** -- Learn how to apply functions to large data sets and extract information. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/*` (flights data) -- `/anvil/projects/tdm/data/election/itcont/*` (election data) -- `/anvil/projects/tdm/data/taxi/yellow/*` (yellow taxi cab data) - -We demonstrate the power of the apply family of functions. - -In this project, we walk students through these powerful techniques. - -== Questions - -=== Question 1 (2 pts) - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - -We can calculate the number of flights starting from Indianapolis airport in 1990 as follows: - - -[source, r] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv") -myvalue <- table(myDF$Origin)['IND'] -myvalue -rm(myDF) ----- - -(We use the `rm` at the end, so that we do not keep this data frame in memory, during the remainder of the project.) - -Now we can replicate this work, using a function, as follows: - -[source, r] ----- -myindyflights <- function(myyear) { - myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv")) - myvalue <- table(myDF$Origin)['IND'] - names(myvalue) <- myyear - return(myvalue) -} ----- - -and we can test that we get the same results: - -[source, r] ----- -library(data.table) -myindyflights(1990) ----- - -Finally, we can use the `sapply` function to run this function on each year from 1987 to 2008. - -[source, r] ----- -library(data.table) -myresults <- sapply(1987:2008, myindyflights) ----- - -which yields the number of flights starting from Indianapolis airport in each year from 1987 to 2008. - -The total number of flights starting from Indianapolis altogether, during 1987 to 2008, is: - -[source, r] ----- -sum(myresults) ----- - -and the number of flights per year is: - -[source, r] ----- -plot(names(myresults), myresults) ----- - - -[NOTE] -==== -The data sets cover October 1987 through April 2008. So the data sets for 1987 and 2008 are smaller than you might expect, and that is OK. -==== - -.Deliverables -==== -- Plot the total number of flights starting from the Indianapolis airport during 1987 to 2008. -==== - - -=== Question 2 (2 pts) - -++++ - -++++ - -We replicate the work from Question 1, but this time, we keep track of all of the flights originating at every airport in the data set. - - -We make a function, very much like Question 1, but this time, we keep track of the full table of the counts of `Origin` airports, for all airports (not just for Indianapolis): - -[source, r] ----- -myflights <- function(myyear) { - myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv")) - myvalue <- table(myDF$Origin) - return(myvalue) -} ----- - -and we can test that this function works for the 1990 flights: - -[source, r] ----- -library(data.table) -myflights(1990) ----- - -Finally, we can use the `sapply` function to run this function on each year from 1987 to 2008. - -[source, r] ----- -library(data.table) -myresults <- sapply(1987:2008, myflights) ----- - -which yields the number of flights starting from each airport in each year, from 1987 to 2008. - -Now we can add up the number of flights across all of the years, as follows: - -[source, r] ----- -v <- unlist(myresults) -tapply(v, names(v), sum) ----- - -and the number of flights starting at each of the top 10 airports during the years 1987 to 2008 is: - -[source, r] ----- -dotchart(tail(sort(tapply(v, names(v), sum)), n=10)) ----- - - - -.Deliverables -==== -- Plot the total number of flights starting from each of the top 10 airports during 1987 to 2008. -==== - - -=== Question 3 (2 pts) - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -Now we follow the methodology of Question 1, but this time we obtain the total amount of the donations from Indiana during federal election campaigns. - -We can extract the total amount of the donations from Indiana during an election year as follows: - -[source, r] ----- -myindydonations <- function(myyear) { - myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15)) - names(myDF) <- c("state", "donation") - myvalue <- tapply(myDF$donation, myDF$state, sum)['IN'] - names(myvalue) <- myyear - return(myvalue) -} ----- - -and we can test this function by discovering how much money was donated from Indiana during the 1990 election cycle: - -[source, r] ----- -library(data.table) -myindydonations(1990) ----- - -Finally, we can use the `sapply` function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018. - -[source, r] ----- -library(data.table) -myresults <- sapply( seq(1980,2018,by=2), myindydonations ) ----- - -which yields the total amount of money donated from Indiana during each election cycle from 1980 to 2018. - -The amount of money donated from Indiana per election cycle is: - -[source, r] ----- -plot(names(myresults), myresults) ----- - - - -.Deliverables -==== -- Plot amount of money donated from Indiana per election cycle from 1980 to 2018. -==== - -=== Question 4 (2 pts) - -++++ - -++++ - -Now we find the top 10 states according to the total amount of the donations from each state during the elections from 1980 to 2018. - -We can extract the total amount of all the donations from all of the states during an election year as follows: - -[source, r] ----- -mydonations <- function(myyear) { - myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15)) - names(myDF) <- c("state", "donation") - myvalue <- tapply(myDF$donation, myDF$state, sum) - return(myvalue) -} ----- - -and we can test this function by discovering how much money was donated from each state during the 1990 election cycle: - -[source, r] ----- -library(data.table) -mydonations(1990) ----- - -Finally, we can use the `sapply` function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018. - -[source, r] ----- -library(data.table) -myresults <- sapply( seq(1980,2018,by=2), mydonations ) ----- - -which yields the total amount of money donated from each state during each election cycle from 1980 to 2018. - -Now we can add up the amount of donations in each state, across all of the years, as follows: - -[source, r] ----- -v <- unlist(myresults) -tapply(v, names(v), sum) ----- - -and the total amount of donations from each of the top 10 states across all election years 1980 to 2018 is: - -[source, r] ----- -dotchart(tail(sort(tapply(v, names(v), sum)), n=10)) ----- - - -.Deliverables -==== -- Plot the amount of money donated from each of the top 10 states altogether during 1980 to 2018. -==== - -=== Question 5 (2 pts) - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -In this last question, we find the total amount of money spent on taxi cab rides in New York City on each day of 2018. - - -We first extract the total amount of the taxi cab rides per day of a given month as follows: - -[source, r] ----- -myfares <- function(mymonth) { - myDF <- fread(paste0("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2018-", mymonth, ".csv"), select=c(2,17)) - mytable <- tapply(myDF$total_amount, as.Date(myDF$tpep_pickup_datetime), sum) - return(mytable) -} ----- - -and we can test this function by discovering how much money was spent on each day in January: - -[source, r] ----- -library(data.table) -myfares("01") ----- - -Finally, we can use the `sapply` function to run this function on each month from `"01"` to `"12"`. - -[source, r] ----- -library(data.table) -myresults <- sapply( sprintf("%02d", 1:12), myfares ) ----- - -which yields the total amount of money spent on taxi cab rides each day. - -Now we can add up the amounts spent per day (sometimes there is overlap from month to month), as follows: - -[source, r] ----- -names(myresults) <- NULL -v <- do.call(c, myresults) -mytotals <- tapply(v, names(v), sum) -betterdates <- mytotals[year(as.Date(names(mytotals))) == 2018] ----- - -and the total amount of money spent on taxi cab rides during each day in 2018 is can be plotted as: - -[source, r] ----- -plot( as.Date(names(betterdates)), betterdates ) ----- - - -.Deliverables -==== -- Plot the total amount of money spent on taxi cab rides during each day in 2018. -==== - -== Submitting your Work - -Now you are familiar with the method of merging data from multiple data frames. - - -.Items to submit -==== -- firstname_lastname_project10.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11-teachinglearning-backup.adoc deleted file mode 100644 index 8441f667c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11-teachinglearning-backup.adoc +++ /dev/null @@ -1,195 +0,0 @@ -= TDM 10100: R Project 11 -- 2024 - -**Motivation:** Time is extremely important to humans **and** data analysis, and us humans have created a _lot_ of different ways to represent times and dates. In this project we'll look at `lubridate`, one of R's most popular packages for handling date-time data and a part of the aforementioned `tidyverse`. We'll combine it with utilities we've looked at in other libraries previously, along with some different datasets and data formats, in order to really polish our data ingestion skills. - -**Context:** Understanding of common `dplyr` and `tidyr` functions, along with knowledge of base R syntax, will be vital for this project. Prioritizing vectorized operations will also be important to ensure good performance in your code. - -**Scope:** `lubridate`, `dplyr`, `tidyr`, `tidyverse`, data ingestion, R - -.Learning Objectives: -**** -- Learn a few basic date and time storage formats commonly used in data -- Learn how to process, manipulate, and analyze date-time data -- Apply basic visual analyses to processed date-time data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/1999.csv` - -== Questions - -=== Question 1 (2 pts) - -Let's begin our exploration of handling dates and times with the most simple and fundamental step: reading them in and understanding their structure. - -Start by reading the dataset for this project into a structure called `flights_dat` (for `/anvil/projects/tdm/data/flights/1999.csv`). - -[NOTE] -==== -At this point, we're going to give you freedom on a lot of these questions. For example, you can choose to either use `fread()` to store your data in a `data.table`, `read.csv()` to store your data into a data.frame, or even `read_csv()` to store your data into a `tibble` (even though we haven't learned about those yet!). Generally , if we don't explicitly limit or request you do things a certain way, you will have the freedom to design and build your own unique solutions to problems going forward. -==== - -Print the head of the dataset. Take a look at any information associated with dates and times in the data. In particular, notice that we have dates of the format "YYYY-MM-DD" along with individual day, month, and year data. - -As a callback to concepts we learned early in the semester, index into the `FlightDate` column and get just the first entry. Print the type of that data using `class()`. - -Next, create a close-to-identical _looking_ bit of data by combining the first entries in the `DayofMonth`, `Month`, and `Year` columns. Don't use `lubridate` just yet. Print the type of this combination using `class`. Is it the same as the date column? (Hint: `paste0()` may be useful here.) - -You can see that it is in fact different! The bit of data from our full date column is of the `date` type, while your combined version is likely something of `character` type. This is an important distinction: when we're working with dates using the `date` type, we have access to all the functions that are specific to dates. Because `character`-type data doesn't have access to these functions, it will be much harder to work with. - -Luckily for us, `lubridate` has a ton of fantastic functions to process dates of various formats from `character` type into their special `date` type that is much easier than regular character data to work with. - -To finish this question, give https://lubridate.tidyverse.org/reference/ymd.html[this short docs page] a quick read, paying special attention ot the `ymd()` function. After applying it to your `character` date, print the `class()`. If successful, it should now read `Date`. - -.Deliverables -==== -- `1999.csv` data in a data structure called `flights_dat` -- The head of `flights_dat` -- The data type of the first date entry in `flights_dat$FlightDate` -- The data type of your version of the first date by combining columns -- The data type of your date, after applying `ymd()` -==== - -=== Question 2 (3 pts) - -So, as you likely gleaned from the linked documentation page, lubridate provides us with a bunch of different functions for parsing character data into date objects. Super useful! As you saw in the last problem, we already have a column labeled FlightDate with our dates in "YYYY-MM-DD" format, so we won't worry about converting the data into `lubridate` type as it has already been done for us. One very important aspect of date data is being able to measure the time between two events occurring. - -First off, try subtracting the first date in our `FlightDate` column from the second. As you can see, `date` types can have arithmetic performed on them just like other types can! - -Great. Next, we're going to start grouping and sorting our data, with an end goal of being able to create a new column that measures the number of days between planes arriving at a given airport. - -Let's create a new dataframe called `BRO_flights` that is a subset of `flights_dat` where the `Dest` column is set to "BRO" (this is the 3 letter code for Brownsville airport in Brownsville, TX). Next, use the https://dplyr.tidyverse.org/reference/arrange.html[`arrange()`] function from `dplyr` to sort our `BRO_flights` data by the `FlightDate` column (Hint: You can do this with the following form: `arrange(dataframe, coltosortby)`). - -Finally, let's make a small aside to learn about the https://dplyr.tidyverse.org/reference/transmute.html[`transmute()`] function from `dplyr`. `transmute()` is essentially the same as the `mutate()` function we learned about in the last project. However, `transmute()` only returns the results of whatever calculation we ask it to make. Another useful function is `lead()`, which allows us to get the "next value" in our data (i.e. the next row). Take a look at the below examples for a breakdown on `transmute()` and `lead()`. Run them yourself and get a good understanding of the output. - -[source, R] ----- -# Use transmute to get all the years, multiplied by 1000 -BRO_flights %>% - transmute(Year * 1000) %>% - head() - -# Use transmute to get each DayOfWeek minus the following DayOfWeek -BRO_flights %>% - transmute(as.numeric(DayOfWeek - lead(DayOfWeek))) - -# Does the exact same as above, but adds clear variable names -BRO_flights %>% - transmute(initDay=DayOfWeek, endDay=lead(DayOfWeek), dayDiff=as.numeric(initDay - endDay)) ----- - -To finish this question, write code that takes your `BRO_flights` dataframe, arranges it by `FlightDate`, and then creates a new dataframe called `BRO_intervals` using `transmute()` that has 3 columns: `initDate` (the current date in the row), `endDate` (the date in the next row. Hint: `lead()`), and `deltaDate`, the difference between `endDate` and `initDate`. Print the head of `BRO_intervals`. Some starter code has been provided for you below: - -[source, r] ----- -BRO_intervals <- BRO_flights %>% - mutate(FlightDate = as.Date(FlightDate)) %>% # converts column from Idate to Date - arrange('''Fill this in''') %>% # arrange by FlightDate - transmute('''Fill this in''') # Create new dataframe ----- - -.Deliverables -==== -- 2nd entry in `FlightDate` minus 1st entry in `FlightDate` -- `BRO_flights` dataframe -- A new dataframe, `BRO_intervals`, as described, with its `head()` printed -==== - -=== Question 3 (2 pts) - -The code we've just written is super useful. We can now easily find the number of days between flight arrivals at BRO airport. However, it would be nice to have more generalized utilities for this. In this question, create two new functions. - -The first, `intervalDFMaker()`, should take as input a 3-letter airport code and return as output a dataframe of the same structure as `BRO_intervals`, but for whatever airport the user provided. - -The second, called `intervalTableMaker()`, takes as input the three letter code associated with an airport and returns as output a table of the number of days between flights arriving at the given airport. - -Both of your functions can assume that the `flights_dat` already exists and is accessible. - -[NOTE] -==== -This should be very similar to code you wrote in the previous question with a few _small_ additions to it. -==== - -Run the below code. Your dataframe's head should be a bunch of flights of Jan 01, 1999, and your table should show that 0 occurred 297915 times and 1 occurred 364 times, signifying that not a single time in 1999 was there a day that a plane didn't arrive at O'hare airport. It's a busy place! - -[source, r] ----- -# test intervalDFMaker() -head(intervalDFMaker("ORD")) - -# test intervalTableMaker() -intervalTableMaker("ORD") ----- - -.Deliverables -==== -- A new function, `intervalDFMaker()`, as described above -- A new function `intervalTableMaker()`, as described above -==== - -=== Question 4 (3 pts) - -Let's finish this project by building on the functions we just made and creating a helpful summary table to compare all our airports at once. - -[NOTE] -==== -As a small reminder, `unique()` can be used to get a list of the unique values for destination airports in your data. Also, consider using `sapply()` instead of `lapply()`, as it makes the sorting of the returned value quite easy. -==== - -First, create a new function called `intervalAverageGetter()` that, given an airport code, gets the average of the deltaDate column in the dataframe returned by `intervalDFMaker`. Apply this function to each `Dest` value present in `flights_dat`, and return a sorted list of the average number of days between flights arriving at an airport. Some starter code has been provided for you below: - -[source, r] ----- -intervalAverageGetter <- function(airportCode) { - return() -} - -# testing code -# USE.NAMES=TRUE will make sure the airport code stays attached to its average -test_df <- sapply('''list of airport codes''', '''function''', USE.NAMES=TRUE) -head(sort(test_df, decreasing=TRUE)) ----- - -[IMPORTANT] -==== -If you're getting "NA" as the result of taking the average, make sure you're using `na.omit()` on your table before taking the average so that there are no `NA` values ruining our average calculation. -==== - -If you did everything correctly, you should see the following as the airports with the largest number of days between arrivals: - -- LFT: 7 -- DRO: 3.27927927927928 -- LWB: 2.2875 -- GUC: 1.71698113207547 -- MTJ: 1.22147651006711 -- DLG: 1.14556962025316 - -.Deliverables -==== -- The top 5 airports with the highest average number of days between flight arrivals. -==== - -== Submitting your Work - -Managing dates is a _crucial_ part of data analysis. With this project complete, you've successfully performed some deep and extremely useful transformations to our flight data, along with creating some functions that are highly variable and can provide utility across a wide range of different data. - -Going forward, think of functions as largely valued by two things: 1) how often they can be reused and 2) how much code they save you from writing. When you write functions that can be applied across a wide range of data, you create tools that are super great. Why throw away a hammer after hitting one nail with it, right? - -In the next two projects, we'll use everything we've learned so far to create beautiful visualizations using ggplot2, one of R's most valuable plotting libraries. I hope you enjoyed this project and look forward to seeing you all next week! - -.Items to submit -==== -- firstname_lastname_project11.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11.adoc deleted file mode 100644 index eac8a5f45..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project11.adoc +++ /dev/null @@ -1,157 +0,0 @@ -= TDM 10100: R Project 11 -- 2024 - -**Motivation:** We continue to learn how to extract information from several files in R. - -**Context:** The `apply` functions in R allow us to gather and analyze data from many sources in a unified way. - -**Scope:** Applying functions to data. - -.Learning Objectives: -**** -- Learn how to apply functions to large data sets and extract information. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/*` (flights data) -- `/anvil/projects/tdm/data/election/itcont/*` (election data) -- `/anvil/projects/tdm/data/icecream/talenti/*` (ice cream data) - -We demonstrate the power of the apply family of functions. - -== Questions - -=== Question 1 (2 pts) - -For this question, only use the data corresponding to flights with `Origin` at the Indianapolis `IND` airport. - -a. Write a function called `monthlydepdelays` that takes a year as the input and uses `tapply` to return a table of length 12 with the average `DepDelay` for flights starting at `IND` in each of the 12 months of that year. - -b. Test your function individually, one at a time, on the years 1990, 1998, and 2005. For instance, if you run `monthlydepdelays(1990)`, the output should be something like: `1: 7.28277205677707 2: 9.49702660406886 3: 6.92484111633048 4: 4.94985835694051 5: 5.47148703956344 6: 6.01083547191332 7: 4.30737704918033 8: 5.63978201634877 9: 4.45558583106267 10: 4.47372488408037 11: 3.4083044982699 12: 9.76410531972058` - -.Deliverables -==== -- Write a function called `monthlydepdelays` that takes a year as the input and returns a table of length 12 with the average `DepDelay` for flights starting at `IND` in each of the 12 months of that year. -- Show the output of `monthlydepdelays(1990)` and `monthlydepdelays(1998)` and `monthlydepdelays(2005)`. -==== - - -=== Question 2 (2 pts) - -First run this command: - -`par(mfrow=c(3,2))` - -which tells R that the next 6 plots should appear in 3 rows and 2 columns. - -Then the sapply function to plot the results of `monthlydepdelays` for the years 1988 through 1993. - -Note: JupyterLab might print `NULL` values if you just run your `sapply` function by itself, but if you run question 2 like this, things should turn out OK: - -[source,r] ----- -par(mfrow=c(3,2)) -myresults <- sapply(1988:1993, function(x) plot(monthlydepdelays(x))) ----- - -.Deliverables -==== -- Make a 3 by 2 frame of 6 plots, corresponding to the results of `monthlydepdelays` in the years 1988 through 1993. -==== - - -=== Question 3 (2 pts) - -For this question, only use the data corresponding to donations from the state of Indiana. - -a. Write a function called `myindycities` that takes a year as the input and uses `tapply` to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city. - -b. Test your function individually, one at a time, on the years 1980, 1986, and 1992. For instance, if you run `myindycities(1984)`, the output should be something like: `FT WAYNE: 44665 TERRE HAUTE: 52650 CARMEL: 53200 EVANSVILLE: 65250 SOUTH BEND: 68387 INDPLS: 76520 FORT WAYNE: 80882 ELKHART: 93171 MUNCIE: 104260 INDIANAPOLIS: 511935` - - -.Deliverables -==== -- Write a function called `myindycities` that takes a year as the input and uses `tapply` to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city. -- Show the output of `myindycities(1980)` and `myindycities(1986)` and `myindycities(1992)`. -==== - -=== Question 4 (2 pts) - -a. Use the list apply function (`lapply`) to run the function `myindycities` on each of the even-numbered election years 1984 to 1994 as follows: - -[source,r] ----- -myresults <- lapply( seq(1984,1994,by=2), myindycities ) -names(myresults) <- seq(1984,1994,by=2) -myresults ----- - -b. Now use `par(mfrow=c(3,2))` and the sapply function too, but this time, make a `dotchart` for each entry in `myresults`. - -[TIP] -==== -Do not worry about the pink warning that appears above the plots. -==== - - -.Deliverables -==== -- Use `lapply` to show the results of `myindycities` for each even-numbered year from 1984 to 1994. -- Make a dotchart for each of the 6 years in part a. -==== - -=== Question 5 (2 pts) - -a. Find the average number of stars in each of these four files: - -`/anvil/projects/tdm/data/icecream/bj/reviews.csv` - -`/anvil/projects/tdm/data/icecream/breyers/reviews.csv` - -`/anvil/projects/tdm/data/icecream/hd/reviews.csv` - -`/anvil/projects/tdm/data/icecream/talenti/reviews.csv` - -b. Write a function `myavgstars` that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company. - -c. Define a vector of length 4, with all 4 of these company names: - -[source,r] ----- -mycompanies <- c("bj", "breyers", "hd", "talenti") ----- -and now use the `sapply` function to run the function from part b that re-computes the values from part a, all at once, like this: - -[source,r] ----- -sapply(mycompanies, myavgstars) ----- - -.Deliverables -==== -- Print the average number of stars for each of the 4 ice cream companies. -- Write a function `myavgstars` that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company. -- Use `sapply` to run the function from part b on the vector `mycompanies`, which should give the same values as in part a. -==== - -== Submitting your Work - -This project further demonstrates how to use the powerful functions in R to perform data analysis. - - -.Items to submit -==== -- firstname_lastname_project11.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12-teachinglearning-backup.adoc deleted file mode 100644 index 4905e717c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12-teachinglearning-backup.adoc +++ /dev/null @@ -1,243 +0,0 @@ -= TDM 10100: R Project 12 -- 2024 - -**Motivation:** Up to this point, we've been hyperfocused on numbers and letters; we've learned how to handle, transform, store, and reformat data to do whatever we need. However, as important as data is, most people don't enjoy staring at spreadsheets and attempting to parse out patterns and meaning. That is why **visualization** is such a key aspect of data analysis. This project will be focused on using `ggplot2` to create clear plots of data trends, and prepare us for the more complex topics that we'll broach in project 13. - -**Context:** A working understanding of piping, `dplyr`, and common `tidyverse` utilities will be helpful for this project. - -**Scope:** Data visualization, `ggplot2`, `dplyr`, `tidyverse`, R - -.Learning Objectives: -**** -- Learn the basic structure of a `ggplot()` -- Create your first plot using `ggplot()` -- Learn about boxplots in `ggplot()` -- Explore common techniques for reformatting figures (i.e. axis scaling, faceting, etc.) -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/youtube/*` - -== Questions - -=== Question 1 (2 pts) - -As a first step, let's talk about the general structure when using `ggplot2`. You can think of a plot created with `ggplot()` as a series of layers that we add that determine the structure, style, and type of plot. For an example structure, along with a more generic version, see below: - -[source, R] ----- -# working example using real data -library(tidyverse) -a <- read_csv("/anvil/projects/tdm/data/youtube/USvideos.csv") - -# specify data to plot, with color coming from comment_count -ggplot(a, aes(x=views, y=comment_count, col=comment_count)) + - # create scatterplot - geom_point() + - # add color scaling based on comment count - scale_color_gradient2(low="yellow", mid="red", high="purple") + - ggtitle("Demonstration Plot") - -# --------------------------------------------------------------- - -# generic example using placeholder names -library(tidyverse) -df <- read_csv("/path/to/data.csv") - -# specify data to plot -ggplot(df, aes(x=column1, y=column2)) + - # create scatterplot - geom_point() + - # set x, y axis labels and Plot Title - xlab("Column1 Data") + - ylab("Column2 Data") + - ggtitle("Fake Plot") ----- - -[NOTE] -==== -For a more detailed and official description of how ggplot calls should be structured, check out this official https://ggplot2.tidyverse.org/articles/ggplot2.html["Getting Started" guide]. -==== - -Try removing specific lines from the working example provided above, swapping around the order of certain lines, or adding your own lines after reading through the linked documentation and see what happens! As I've noted before, experimentation is a _huge_ part of working with code, so don't be afraid to experiment, break things, kill your kernel and more! It's all steps towards a well-made end product. - -For this question, first read the data from `/anvil/projects/tdm/data/youtube/USvideos.csv` into a data structure called `US_vids`, and then use `ggplot2` to create a scatterplot of `likes` versus `comment_count` for the US videos data. You don't have to worry about specifying a color of any type, but you should be sure that your plot has appropriate axes and plot titles (Hint: Look at the above example!). - -.Deliverables -==== -- A new data structure called `US_vids` -- A scatterplot of `likes` vs. `comment_count` for the `US_vids` data -==== - -=== Question 2 (2 pts) - -Scatter plots are a great starting point for visualizing data, but oftentimes we are interested in visuals that provide us with some more immediate and obvious statistical data. For this purpose we have the _box plot_! If you aren't already familiar with box plots, feel free to https://en.wikipedia.org/wiki/Box_plot[take a look here] to learn more! - -`ggplot2` provides us with a function specifically for creating box plots: `geom_boxplot()`. Using the same general structure as we used in the last problem, construct a boxplot of the `comment_count` data from `US_vids`. It may look a bit nonsensical, but don't worry as we'll handle that in a second. - -[NOTE] -==== -We aren't asking you to make any separations on the X-axis on this problem, so you only have to specify the _y_ variable in your call to `ggplot()`. -==== - -As you can now see, the boxplot we've created is not very useful. This is because, rather predictably, a combination of the large majority of our data being very close to 0 and our data having a large range obscures any information we can glean from the plot. - -[IMPORTANT] -==== -You will likely see a warning messgae about `transformation introduced infinite values` when you attempt axis scaling in this next part of the question. This is a result of `0` values in our data, as `log10(0)` is undefined. This means that `0` values will be discluded from our scaled plot; not a dealbreaker, but certainly something to keep in mind going forward. -==== - -How can we fix this? _Axis scaling_! By default our plot's axes are scaled _linearly_, but oftentimes linear views of data can be rather useless. Luckily, `ggplot2` provides us with plenty of functions to easily change the scaling of our axes. Take a look at https://ggplot2.tidyverse.org/reference/scale_continuous.html[this docs page] detailing how to scale axes with `ggplot2()` (You can scroll down to the bottom to view examples). Using either `scale_y_continuous()` or one of its variants (i.e. `scale_log10()`), create a new boxplot where its a bit easier to discern the median and quartiles of the data. - -[IMPORTANT] -==== -Going forward, we may always include reminders to title your plot and axes appropriately. Remember that any plots included in your project should always have appropriate titles and, when necessary, a well-formatted legend. Plot formatting is a skill, and practicing it throughout these projects will help you hone it well. -==== - -.Deliverables -==== -- A boxplot of `comment_count` -- A boxplot of `comment_count` using axis scaling -==== - -=== Question 3 (2 pts) - -Build on box-and-whisker of likes or comments by splitting based on category_id - -Already we have a much more useful version of our plot than we originally did, simply by introducing axis scaling. However, _comparison_ is always our friend in data science and we have yet to do any! - -In this question, we want you to build on the boxplot you created in the last problem, this time providing a data source for the x axis to use. In project 6, questions 1-2, you were asked to write some code to introduce a new column, `category`, into your data based on the existing `category_id` column. Copy/paste that code in so that you create an equivalent column in this project. Then create a new figure using `ggplot()` that consists of a series of boxplots separeted by `category`. - -[NOTE] -==== -If you're having trouble getting your X axis labels to be readable and non-overlapping, try taking a look at https://stackoverflow.com/questions/42599953/ggplot-with-overlapping-x-axis-label[this] helpful stackoverflow post. -==== - -Finally, in a markdown cell, identify which category had the highest interquartile range (the "largest box") and which category had the highest maximum value for comment count. - -.Deliverables -==== -- Boxplots of `comment_count` separated by `category` -- A sentence detailing which category had the highest interquartile range which category had the highest max comment count -==== - -=== Question 4 (2 pts) - -We've got a good compairison between categories for U.S. YouTube videos now, and some ideas about what types of videos people comment on most in the U.S.. Do you think that trend holds internationally? Luckily, its now within our power to find out! - -Start by reading the data from `/anvil/projects/tdm/data/youtube/DEvideos.csv` into a new data structure called `DE_vids`. Modify the `category_id` mapping code you referenced in the last question to again map category IDs to names in your new dataset. You may assume that the `category_id` to `category` mapping is the same for both US and DE (so you can re-use the same list). - -Next, run your boxplot-by-category code using the DE data instead of the US data. Are the distributions similar to those of the U.S. data? - -In a markdown cell, list which category in the DE data had the largest interquartile range and which had the highest maximum value. Also note if either of these categories is different from those in the US. - -Finally, add a few sentences describing any major differences you see between the two distributions. Note outliers, interesting patterns, or large deviations you find between the two of them. Well-written responses with at least 2 sentences will be accepted. - -.Deliverables -==== -- A 'DE' version of your plot from the last problem -- A sentence or two comparing the category with the highest IQR (interquartile range) and maximum in the DE data to those in the US data -- A few sentences describing any major differences or patterns between the two distributions. -==== - -=== Question 5 (2 pts) - -The comparison we just made was already a huge step forward in our data analysis, but it sure is inconvenient to have to scroll up and down between the two plots like that. Luckily, `faceting` exists to help us put our different plots all on the same figure! - -As a basic example, here is how one could create a faceted boxplot by category. Compare this to question 3, where we just stuffed the categories onto the same axis. There are a lot of benefits to this approach! Try running it yourself to see how it works. - -[source, r] ----- -# filter just for 3-4 categories, for easy comparison -categories <- c("News and Politics", "Education", "Comedy", "Gaming") -filtered_vids <- filter(US_vids, category %in% categories) - -# create faceted boxplot -ggplot(filtered_vids, aes(y=comment_count)) + - geom_boxplot() + - scale_y_continuous(trans = "log10") + - facet_wrap(~ category, scales = 'free') ----- - -For a more complex look, take a glance at the below example, where I've used faceting to compare side-by-side the "Comedy" categories for the US and DE. For more information, take a look at https://stackoverflow.com/questions/57457872/how-to-use-ggplot-faceting-with-data-from-different-dataframes[this stackoverflow post] or https://stackoverflow.com/questions/32747808/facets-and-multiple-datasets-in-ggplot2[this one]. - -[source, r] ----- -# add a country identifier column to each dataset -US_vids$country <- 'US' -DE_vids$country <- 'DE' - -# merge our US and DE datasets -US_DE_vids <- bind_rows(US_vids, DE_vids) - -# filter for just comedy videos -US_DE_comedy <- filter(US_DE_vids, category == "Comedy") - -# use faceting to plot, separating different countries -ggplot(US_DE_comedy, aes(x = category, y = comment_count)) + - geom_boxplot() + - scale_y_continuous(trans = "log10") + - facet_wrap(~ country, scales = 'free') + - labs(title = 'Faceted Box Plots', x = 'Category', y = 'Comment Count') + - theme_minimal() ----- - -[IMPORTANT] -==== -If you want the category column to show up correctly, you will need to re-run the previous code using `match()` and your list of category-ID mappings in order to add it to your data. -==== - -To complete this question, choose at least one country other than US or DE from the list below, choose any category of video, and create a faceted boxplot figure with one boxplot for each country. The actual boxplot data can be anything you want, whether that's `comment_count`, `views` or something else. - -[IMPORTANT] -==== -Countries to choose from include: - -- US -- DE -- CA -- FR -- GB -- IN -- JP -- KR -- MX -- RU -==== - -[NOTE] -==== -If you're struggling with this question, take a look at the provided example where the 'US' and 'DE' comedy categories are compared. Your work should follow a very similar logic to this and in fact you can almost entirely complete this question using copy and paste from the example alone! -==== - -Then, in a markdown cell, write 3-4 sentences about observations you can make from your faceted plot. Compare and contrast the distributions between the different countries, and feel free to suggest some potential driving factors for the differences between each country. - -.Deliverables -==== -- A faceted boxplot figure comparing at least 3 different countries in some way -- 3-4 sentences of analysis on the differences between countries' distributions -==== - -== Submitting your Work - -With that, you have now completed our first project on data visualization in R! Hopefully, you can now see the utility of `ggplot2`, and how the structure of a `ggplot()` call makes it very easy to swap out different components of a plot or adjust layout without breaking the whole thing. - -In the next project, we'll go further in depth to the different types of plots available to you in `ggplot2`, and give you some freedom to explore and experiment with all the tools you've used throughout the semester. - -You're almost done with the class, and it has been an absolute privilege to get to work with you all this year. Please reach out if you need anything, and I look forward to seeing you next week! - -.Items to submit -==== -- firstname_lastname_project12.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12.adoc deleted file mode 100644 index 2b71375a4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project12.adoc +++ /dev/null @@ -1,138 +0,0 @@ -= TDM 10100: R Project 12 -- 2024 - -**Motivation:** Some files are too big to read into R. We practice methods for extracting the portions of a file that we need. - -**Context:** The `fread` function allows us to read a portion of a file (either a small number of rows, and/or a small number of columns). - -**Scope:** We learn how to work with enormous files in R, if we only need a portion of the file in our analysis. - -.Learning Objectives: -**** -- Learn how to apply functions to large data sets and extract information. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv` (Iowa liquor sales) -- `/anvil/projects/tdm/data/election/itcont2020.txt` (2020 election data) -- `/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv` (Internet Movie DataBase (IMDB)) - -== Questions - -[WARNING] -==== -Please use 2 cores in your Jupyter Lab session for this project. -==== - -=== Question 1 (2 pts) - -First load: `library(data.table)` so that you have the `fread` function available, and also `options(repr.matrix.max.cols=50)` so that you can see 50 columns. - -If you try to read the entire file `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales.csv` then the `seminar-r` kernel will crash. - -Instead, use `fread` with the option `nrows=1000` (in part a) so that you can see (only) the first 1000 rows of the data frame. - -a. Which columns contain the `Store Number`, `Store Name`, `Address`, `City`, and `Zip Code`? (Use the option `nrows=1000` for part a.) - -b. Use the `select` option in `fread` to read the entire data set, but *only* those 5 columns of the data. What is the dimension of the data set? (Hint: it should now be 5 columns and more than 27 million rows. Another hint: Do NOT use `nrows=1000` for part b. Instead, use the `select` option that we learned in earlier projects.) - -.Deliverables -==== -- Which columns contain the `Store Number`, `Store Name`, `Address`, `City`, and `Zip Code`? -- What is the dimension of the data set? -==== - - -=== Question 2 (2 pts) - -Now let's find the most popular 10 stores (from all 27 million rows, and only the 5 columns in question 1) in two different ways: - -a. Use the `table` command to find the 10 most popular values of the `"Store Number"` - -b. Use the `table` command to find the 10 most popular values of the `"Store Name"` - -c. Use the `table` command to find the 10 most popular values of these three columns pasted together: `Address`, `City`, `"Zip Code"` - -d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree? - - -.Deliverables -==== -- a. A table with the 10 most popular values of the `"Store Number"` and the number of occurrences of each. -- b. A table with the 10 most popular values of the `"Store Name"` and the number of occurrences of each. -- c. A table with the 10 most popular values of these three columns pasted together: `Address`, `City`, `"Zip Code"` and the number of occurrences of each. -- d. Do your answers to parts (a) and (b) seem to agree? Do your answers to parts (a) and (c) seem to agree? - -==== - - -=== Question 3 (2 pts) - -For this question, use `fread` to read all 27 million rows of the data set again, but this time, only read in the columns `"Zip Code"`, `"Category Name"`, `"Sale (Dollars)"`. - -a. Use the tapply function to sum the values of `"Sale (Dollars)"` according to the `"Zip Code"`. Find the 10 `"Zip Code"` values that have the largest sum of `"Sale (Dollars)"` altogether, and give those `"Zip Code"` values and each of their sums of `"Sale (Dollars)"`. - - -b. Now use the tapply function to sum the values of `"Sale (Dollars)"` according to the `"Category Name"`. Find the 10 `"Category Name"` values that have the largest sum of `"Sale (Dollars)"` altogether, and give those `"Category Name"` values and each of their sums of `"Sale (Dollars)"`. - - -.Deliverables -==== -- Find the 10 `"Zip Code"` values that have the largest sum of `"Sale (Dollars)"` altogether, and give those `"Zip Code"` values and each of their sums of `"Sale (Dollars)"`. -- Find the 10 `"Category Name"` values that have the largest sum of `"Sale (Dollars)"` altogether, and give those `"Category Name"` values and each of their sums of `"Sale (Dollars)"`. -==== - -=== Question 4 (2 pts) - -Use `fread` to read only the 10th and 15th fields of this huge file: `/anvil/projects/tdm/data/election/itcont2020.txt` (do not worry about the warning in a pink box that appears). - -Sum the amount of the donations (from the 15th field) according to the state (from the 10th field). List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states. - - -.Deliverables -==== -- List the top 10 states according to the sum of the donation amounts, and the sum of the donation amounts in each of these top 10 states. -==== - -=== Question 5 (2 pts) - -Read the first 10 lines of `/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv` and notice that the entries of the `genres` column are strings with several types of genres, separated by commas. On the `genres` column (for only these 10 lines), run the following: - -`myDF$genres` - -`strsplit(myDF$genres, ',')` - -`unlist(strsplit(myDF$genres, ','))` - -`table(unlist(strsplit(myDF$genres, ',')))` - -Notice that, in this way, using the `strsplit` function, we can find out how many times each of the individual `genres` occur. - -Now read in *only* the `genres` column of the entire file (do not worry about the warning that results). For each of the `genres`, list how many times it occurs. For instance, `Action` occurs 462531 times. - -.Deliverables -==== -- For each of the `genres`, list how many times it occurs. -==== - -== Submitting your Work - -This project enables students to select the relevant columns of a data frame for their analysis. - - -.Items to submit -==== -- firstname_lastname_project12.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13-teachinglearning-backup.adoc deleted file mode 100644 index ad888c64b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13-teachinglearning-backup.adoc +++ /dev/null @@ -1,298 +0,0 @@ -= TDM 10100: R Project 13 -- 2024 - -**Motivation:** Up until the previous project, we've been hyperfocused on numbers and letters; we've learned how to handle, transform, store, and reformat data to do whatever we need. However, as important as data is, most people don't enjoy staring at spreadsheets and attempting to parse out patterns and meaning. That is why **visualization** is such a key aspect of data analysis. This project will cap our exploration of `ggplot2` by exploring new types of visuals and providing some more open-ended questions for you to test your creative and data science skills learned throughout the semester. - -**Context:** A working understanding of piping, `dplyr`, and common `tidyverse` utilities will be helpful for this project. A strong understanding of the typical structure of a call to `ggplot2` will also be good for this project. - -**Scope:** Data visualization, `ggplot2`, `dplyr`, `tidyverse`, R - -.Learning Objectives: -**** -- Learn about different types of `ggplot2` visuals -- Create visuals of different types based on problem goals -- Answer open-ended questions with visuals and analysis of your own design -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/youtube/*` - -== Questions - -=== Question 1 (2 pts) - -[IMPORTANT] -==== -While this is the last _content-based_ project, you will have one more graded project that is dedicated to providing feedback and advice on how to improve the course. It should be relatively quick and easy to complete, so I would recommend taking some time to be sure you get that easy 100%! -==== - -To cap off this semester's content, we will be learning about some interesting and unique types of visuals included in `ggplot2` and their use cases along with providing a few open-ended questions that provide you the opportunity to showcase the skills you've been developing all semester! - -To start, read the data from `/anvil/projects/tdm/data/youtube/USvideos.csv` into a new data structure called `USvids`. We will be working with the YouTube data again this project, both because you should already be roughly familiar with it and because you will be able to contrast your work now against that of the last project's and decide which plot types you like best. - -`ggplot2` comes with a huge variety of plot types to choose from, with each having its own specific use cases based on functional benefits along with providing aesthetic choices between similar plot types for added customization. - -One great example of this is the **histogram**, which is one of the most common types of plots used to look at the distribution of data. Take a look at the below example, where we use a histogram to display the distribution of likes on videos in `USvids`. - -[source, r] ----- -# specify data to plot -ggplot(USvids, aes(y=likes)) + - # create histogram with 30 bins - geom_histogram(bins=30) + - ggtitle("Video Like Distribution") + - xlab("Video Likes") + - ylab("Count") ----- - -As you can see from the above example, our distribution is _extremely_ right-tailed, with a super high frequency centered around 0 video likes. Because this obscures many of the smaller distribution patterns in the data, we may want to try and focus on important parts of the data or scale our graph in some way. - -For example, we could filter our existing data and then create a histogram of likes only within the first and third quartile of our data, like so: - -[source, r] ----- -# filter data between Q1 and Q3 -filtered_USvids <- filter(USvids, - likes >= quantile(USvids$likes)["25%"], - likes <= quantile(USvids$likes)["75%"]) - -# specify data to plot -ggplot(filtered_USvids, aes(likes)) + - # create histogram with 50 bins - geom_histogram(bins=50) + - ggtitle("Video Like Distribution; Q1-Q3") + - xlab("Video Likes") + - ylab("Count") ----- - -For this question, do the following: - -- Create a histogram of `comment_count` in `USvids` -- Create a histogram of `comment_count` in `USvids`, filtered for data only between the first and third quartile -- In a markdown cell, compare the two distributions. How are they similar, how are they different, and why? - -[IMPORTANT] -==== -As always, remember to properly style and label _all_ plots that you submit for full credit. -==== - -.Deliverables -==== -- Histogram of `comment_count` -- Histogram of `comment_count` within the IQR for `comment_count` -- 2-4 sentences comparing your two distributions and positing reasons for any similarities or differences between the two -==== - -=== Question 2 (2 pts) - -Histograms can be useful for looking purely at the distribution of values in data, but oftentimes we want to make more complex comparisons over time to identify trends. - -A great and very useful way to do this is to combine the `lubridate` functions we learned about in a previous project with `ggplot2` 's `geom_line()` plot type. - -Take a look at the below example, where we make a line plot of likes over time. Run the code, and examine the resulting visual. - -[source, r] ----- -ggplot(USvids, aes(x=publish_time)) + - geom_line(aes(y=likes)) + - labs(title="a", - subtitle="b", - caption="c", - y="Like Count") ----- - -As you can see, plotting the publish date and like count of every single video seems to obscure any trends, as videos with “low” like counts appear throughout the entire graph. There are plenty of ways we could handle this, with one of which being instead plotting the average amount of likes for each period of time. For example, the below code will plot the average amount of likes on a video published in a given month and year. - -[source, r] ----- -# create a tibble of average likes per month -avg_M_Y <- USvids %>% - # Create a new variable for month and year pairs - mutate(publish_month = format(as.Date(publish_time), "%m"), - publish_year = format(as.Date(publish_time), "%Y")) %>% - # get the average for each month-year pair - group_by(publish_month, publish_year) %>% - summarize(avg_M_Y = mean(likes, na.rm=TRUE), .groups='drop') - -# Convert publish_month and publish_year back to Date format -avg_M_Y <- avg_M_Y %>% - mutate(month_year = as.Date(paste(publish_year, publish_month, "01", sep = "-"))) - -# Plotting with ggplot2 -avg_M_Y %>% - ggplot(aes(x = month_year, y = avg_M_Y, group = 1)) + - geom_line() + - labs(x = "Month_Year", y = "Average Like Count", title = "Average Like Count per Month_Year") + - theme_minimal() + - scale_x_date(date_breaks = "1 year", date_labels = "%Y") + - theme(axis.text.x = element_text(angle = 45, hjust = 1)) ----- - -Another approach could be to plot each year with its own line for easy comparison between years, like so: - -[source, r] ----- -# create a tibble of average likes per month -avg_M_Y <- USvids %>% - # Create a new variable for month and year pairs - mutate(publish_month = format(publish_time, "%m"), - publish_year = format(publish_time, "%Y")) %>% - # get the average for each month-year pair - group_by(publish_month, publish_year) %>% - summarize(avg_M_Y = mean(likes, na.rm=TRUE), .groups='drop') - -# Convert publish_month and publish_year back to Date format -avg_M_Y <- avg_M_Y %>% - mutate(month_year = as.Date(paste(publish_year, publish_month, "01", sep = "-"))) - -# Plotting with ggplot2 -avg_M_Y %>% - ggplot(aes(x = publish_month, y = avg_M_Y, color = publish_year, group = publish_year)) + - geom_line() + - labs(x = "Month", y = "Average Like Count", title = "Average Like Count per Month by Year") + - theme_minimal() + - theme(axis.text.x = element_text(angle = 45, hjust = 1)) ----- - -As you can see, the general approach above was to first isolate the data we wanted to plot and then plot it. While there are myriad approaches to this problem, some potentially more concise, separating the data explicitly like this can make pre-processing and grouping much simpler, and we recommend you take a similar approach throughout the rest of this project. - -To finish this question, create two plots as described below: - -- create a `geom_line()` plot that displays average comment_count for each month, with all the years along the same axis (as in the first example) -- create a `geom_line()` plot that displays average comment_count for each month, with each year represented by a different line of a different color (as in example two) - -.Deliverables -==== -- A one-line plot of average `comment_count` per month -- A line plot of average `comment_count` per month, using different lines for each year -==== - -=== Question 3 (2 pts) - -Now that we've developed a solid approach for observing time-based patterns in our data, we are ready to build on it for further comparisons. - -Load the data from `/anvil/projects/tdm/data/youtube/CAvideos.csv` and `/anvil/projects/tdm/data/youtube/FRvideos.csv`. Using the _faceting_ that you learned about in the last project, create a line plot that compares the average comment count per month in each country. - -Each plot should be a multi-line plot, where each line is a different year in the data for that country. We'll provide some starter code that demonstrates how to quickly combine the country data below. - -[source, r] ----- -# Combine data from all three tibbles -combined_data <- bind_rows( - USvids %>% mutate(country = "USA"), - CAvids %>% mutate(country = "Canada"), - FRvids %>% mutate(country = "France") -) - -# Create a tibble of average likes per month -# EXERCISE LEFT TO THE READER - -# Plotting with ggplot2, facet by country -# EXERCISE LEFT TO THE READER ----- - -While this may seem like a lot, it is almost entirely copy-paste from the previous question. For a reminder on exactly how faceting works, take a look back at Question 5 from Project 12 for a digestible example. Depending on how much you take from the previous question, this problem can be solved by adding only one extra line to the starter code! (Not counting any copy-pasted lines) - -Finish this question off by writing a few sentences analyzing the patterns between countries. Is there anything of note? - -.Deliverables -==== -- A faceted line plot, for the US, France, and Canada data -- A few sentences, in a markdown cell, describing any trends or differences you see between countries. -==== - -=== Question 4 (2 pts) - -Now that we've looked at a few examples of more complex plots available to us, its your turn to express your creativity and skill learned throughout the semester. Using a visualization of your choice from http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html[this list], create a plot that demonstrates the average number of likes, by category, videos in the `USvideos` dataset got. You may not use any plot type already covered in this project. - -You may find the following code helpful to map the numerical category IDs to their actual names, such that your plot is easier to understand. - -[source, r] ----- -# create dict of ID-name pairs -name_ids <- c("Film & Animation" = 1, - "Autos and Vehicles" = 2, - "Music" = 10, - "Pets & Animals" = 15, - "Sports" = 17, - "Short Movies" = 18, - "Travel & Events" = 19, - "Gaming" = 20, - "Videoblogging" = 21, - "People & Blogs" = 22, - "Comedy" = 23, - "Entertainment" = 24, - "News and Politics" = 25, - "Howto & Style" = 26, - "Education" = 27, - "Science & Technology" = 28, - "Nonprofits & Activism" = 29, - "Movies" = 30, - "Anime/Animation" = 31, - "Action/Adventure" = 32, - "Classics" = 33, - "Comedy" = 34, - "Documentary" = 35, - "Drama" = 36, - "Family" = 37, - "Foreign" = 38, - "Horror" = 39, - "Sci-Fi/Fantasy" = 40, - "Thriller" = 41, - "Shorts" = 42, - "Shows" = 43, - "Trailers" = 44) - -# map the dictionary to the numerical IDs present in our data -US_vids["category"] <- names(name_ids)[match(US_vids$category_id, name_ids)] ----- - -For full credit, ensure your plot is well-formatted and makes clear what categories had the highest and lowest average likes. Be sure to include appropriate axes labels and a legend! - -.Deliverables -==== -- A plot demonstrating average likes, by category, for `USvids` -==== - -=== Question 5 (2 pts) - -To finish off this project, and the course content as a whole for the semester, we are going to provide you the opportunity to create your own question. - -To receive full credit, you must think of a question about the data and then, using a plot, answer that question to the best of your abilities. Your final answer should include a markdown cell containing your created question, a `ggplot2` plot of a type that we have not used, and that you didn't use in the last question, and another markdown cell answering your question, linking the plot you created to your provided answer. - -Take a look at the below for some examples of acceptable questions. Feel free to build on these, but don't just copy them and use them for your own: - -- Do different countries have similar trends for popularity of videos over time? -- Which category of video has the highest comment count, on average? -- Are different categories of video published more often at specific times? - -If you're really struggling to think of a question, consider using one of the above examples, but making comparisons between the different countries available to us. Take the time to develop a question that's interesting to you, and create a quality answer to it. - -.Deliverables -==== -- Your invented question along with its associated plot and answer. -==== - -== Submitting your Work - -With this project complete, you've now finished all of the new course content for TDM 10100! While this may signify the end of our formal learning together _in this class_, we really hope to see you continue with The Data Mine and are so grateful for the opportunity to get to know each of you better throughout this semester. - -If you have _any_ feedback about this course, including what projects you thought were too easy/difficult, logistics you think needed improving, or anything else that comes to mind, please use Project 14 as your time to voice those thoughts and help us improve this class going forward. - -Regardless, we are so grateful for the opportunity to interact with you this semester, and we hope to be able to continue to support you in your learning journey in the future. Thanks so much, and have a great winter break! - -.Items to submit -==== -- firstname_lastname_project13.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13.adoc deleted file mode 100644 index bf6ec6f9c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project13.adoc +++ /dev/null @@ -1,216 +0,0 @@ -= TDM 10100: R Project 13 -- 2024 - -**Motivation:** It is fun and straightforward to do mapping in R! - -**Context:** We will create several maps in R, including one with a Thanksgiving theme. - -**Scope:** Maps in R are large, so when you download your Jupyter Lab project to your computer and then upload it to Gradescope, you won't be able to easily view it in Gradescope, but that's OK. The graders will still be able to see it. - -.Learning Objectives: -**** -- Learn how to make maps in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/craigslist/vehicles.csv` (Craigslist vehicles) -- `/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv` (New York City taxi cab data from November 2015) - - -== Questions - -[WARNING] -==== -Please use 3 cores in your Jupyter Lab session for this project. -==== - -=== Question 1 (2 pts) - -First load: `library(data.table)` so that you have the `fread` function available, and also `options(repr.matrix.max.cols=50)` so that you can see 50 columns. - -Also load: `options(jupyter.rich_display = T)` so that you can draw maps in Jupyter Lab. - -For this project, we need two maps packages: `library(leaflet)` and `library(sf)` - -For this first question, we simply make a simple map, with three latitude and longitude values. - -We call our data frame `testDF`, and it is important that we name the columns: - -[source,r] ----- -testDF <- data.frame(c(40.4259, 41.8781, 39.0792), c(-86.9081, -87.6298, -84.17704)) -names(testDF) <- c("lat", "long") ----- - -The whole data frame looks like this, just three rows and two columns: - -[source,r] ----- -testDF ----- - -Now we can define the points that we want to plot on a map: - -[source,r] ----- -points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326) ----- - -and we can render the map. We make each dot have radius 1, but you are welcome to change the radius if you want to: - -[source,r] ----- -addCircleMarkers(addTiles(leaflet( testDF )), radius=1) ----- - - -.Deliverables -==== -Show the map with the three points. -==== - - -=== Question 2 (2 pts) - -Now load the Craiglist data as follows: - -[source,r] ----- -myDF <- fread("/anvil/projects/tdm/data/craigslist/vehicles.csv", - stringsAsFactors = TRUE, nrows=100) ----- - -Examine the `head` of `myDF` and also the `dim` of `myDF`, and see which columns are the `state` and the `long` and the `lat` columns. - - -Now read ALL of the rows of the data set into a new data frame, but only the three columns called `state` and `long` and `lat` (the other columns will not be needed). - -Make a `subset` of this new data frame, satisfying 3 conditions, namely, the `state` variable indicates that the data is from Indiana, and the `long` and `lat` values are not missing. - -[source,r] ----- -(state=="in") & (!is.na(long)) & (!is.na(lat)) ----- - -You should now have a data frame with 3 columns and 5634 rows. - - -.Deliverables -==== -Display the dimension of your new data frame, which should have 3 columns and 5634 rows. -==== - - -=== Question 3 (2 pts) - -Now make a plot of the data frame that you created in question 2, using these two lines of R: - -[source,r] ----- -points <- st_as_sf( mynewdataframe, coords=c("long", "lat"), crs=4326) -addCircleMarkers(addTiles(leaflet( mynewdataframe )), radius=1) ----- - -Please note that, with Craigslist, people can list items from anywhere in the country. So there are some items outside Indiana, even though we selected only the items that are supposed to be from the State of Indiana. BUT fortunately, you will see that most people's listings are accurate. In other words, if you zoom in and out on the map, you will see that most of the dots appear in Indiana. - - -.Deliverables -==== -Show the map with the Craiglist data from Indiana. (Some of the data points will be outside Indiana, but most of them will be in the State of Indiana.) -==== - -=== Question 4 (2 pts) - -In question 4 and question 5, we will verify the path of the Thanksgiving parade in New York City from 2015, as shown on this image: - -https://www.bizjournals.com/newyork/news/2015/11/25/thanksgiving-is-tomorrow-but-parade-re-routes-and.html - -You can import the New York taxi cab data from November 2015 as follows: - -[source,r] ----- -myDF <- fread("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2015-11.csv", tz="") ----- - -(The `tz=""` indicates that the time zone is not given in this data.) - -Make a new data frame called `thanksgivingdayDF` by using the `subset` function, with the option `grepl("2015-11-26", tpep_pickup_datetime)` to extract the rows of the data from Thanksgiving day. Your new data frame should have 242393 rows and 19 columns. - -Create two times in R, one for the start of the parade, and one for the end of the parade: - - -[source,r] ----- -paradestart <- strptime("2015-11-26 09:00:00", format="%Y-%m-%d %H:%M:%S", tz="EST") - -paradeend <- strptime("2015-11-26 12:00:00", format="%Y-%m-%d %H:%M:%S", tz="EST") ----- - -Now make a vector of times (converting the pickup times from the taxi cab rides from strings into times). - -[source,r] ----- -mytimes <- strptime(thanksgivingdayDF$tpep_pickup_datetime, format="%Y-%m-%d %H:%M:%S", tz="") ----- - -Finally, make a new data frame called `finalDF` from the data frame `thanksgivingdayDF`, using the `subset` function with the condition `(mytimes > paradestart) & (mytimes < paradeend)`. - -Your data frame `finalDF` should have 28704 rows. - - - -.Deliverables -==== -Display the dimension of your data frame called `finalDF`, which should have 28704 rows. -==== - -=== Question 5 (2 pts) - -If you examine the head of `finalDF`, you see that the latitude values are called `pickup_latitude` and `pickup_longitude`. - -We want them to be called `lat` and `long` instead, so we can make a new data frame as follows: - -[source,r] ----- -testDF <- data.frame( finalDF$pickup_latitude, finalDF$pickup_longitude) -names(testDF) <- c("lat","long") ----- - -Finally, plot the latitude and longitude values from `testDF` using a smaller radius than you used in Question 1 and Question 3. We suggest `radius=.1`. - -You will notice that taxi cabs were unable to pickup passengers on the route of the Thanksgiving Day parade because those roads were closed. Please zoom into the map and verify this, comparing your map to the parade route map: - -https://www.bizjournals.com/newyork/news/2015/11/25/thanksgiving-is-tomorrow-but-parade-re-routes-and.html - - -.Deliverables -==== -Show the map with the data from Thanksgiving morning on November 26, 2015, at the time of the parade. -==== - -[WARNING] -==== -Because of the maps in this project, when you upload your work to Gradescope, it will say: "Large file hidden. You can download it using the button above." That is what the graders will do, namely, they will download it when they are grading it. This warning is expected because your maps are large, and that is totally OK. -==== - -== Submitting your Work - -This project gives you familiarity with mapping in R. - - -.Items to submit -==== -- firstname_lastname_project13.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project14.adoc deleted file mode 100644 index ed6f1c08a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project14.adoc +++ /dev/null @@ -1,89 +0,0 @@ -= TDM 10100: R Project 14 -- 2024 - -**Motivation:** We covered a _lot_ this semester! From basic paradigms of `R` as a language, to concepts about working with data, we hope that you have had the opportunity to learn a lot, and to improve your data science skills. For our final project of the semester, we want to provide you with the opportunity to give us your feedback on how we connected different concepts, built up skills, and incorporated real-world data throughout the semester, along with showcasing the skills you learned throughout the past 13 projects! - -**Context:** This last project will work as a consolidation of everything we've learned thus far, and may require you to back-reference your work from earlier in the semester. - -**Scope:** R, Data Science - -.Learning Objectives: -**** -- Reflect on the semester's content as a whole -- Offer your thoughts on how the class could be improved in the future -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -The Data Mine team is writing a Data Mine book to be (hopefully) published in 2025. We would love to have a couple of paragraphs about your Data Mine experience. What aspects of The Data Mine made the biggest impact on your academic, personal, and/or professional career? Would you recommend The Data Mine to a friend and/or would you recommend The Data Mine to colleagues in industry, and why? You are welcome to cover other topics too! Please also indicate (yes/no) whether it would be OK to publish your comments in our forthcoming Data Mine book in 2025. - -.Deliverables -==== -Feedback and reflections about The Data Mine that we can potentially publish in a book in 2025. -==== - -=== Question 2 (2 pts) - -Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -.Deliverables -==== -- A markdown cell detailing your favorite dataset, why, and a working example and question you did involving that dataset. -==== - -=== Question 3 (2 pts) - -While working on the projects, how did you validate the results that your code produced? For instance, did you try to solve problems in 2 different ways? Or did you try to make summaries and/or visualizations? How did you prefer to explore data and learn about data? Are there better ways that you would suggest for future students (and for our team too)? Please illustrate your approach using an example from one problem that you addressed this semester. - -.Deliverables -==== -- A few sentences in a markdown cell on how you conducted your work, and a relevant working example. -==== - -=== Question 4 (2 pts) - -Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please cite specific questions and explain both how they confused you and how you would recommend improving them. - -.Deliverables -==== -- A few sentences in a markdown cell on which questions from projects you found confusing, and how they could be written better/more clearly, along with specific examples. -==== - -=== Question 5 (2 pts) - -Please identify 3 skills or topics related to the R language or data science (in general) that you wish we had covered in our projects. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - -.Deliverables -==== -- A markdown cell containing 3 skills/topics that you think we should've covered in the projects, and an example of why you believe these topics or skills could be relevant and beneficial to students going through the course. -==== -=== OPTIONAL but encouraged: - -Please connect with Dr Ward on LinkedIn: https://www.linkedin.com/in/mdw333/ - -and also please follow our Data Mine LinkedIn page: https://www.linkedin.com/company/purduedatamine/ - -and join our Data Mine alumni page: https://www.linkedin.com/groups/14550101/ - - - -== Submitting your Work - -If there are any final thoughts you have on the course as a whole, be it logistics, technical difficulties, or nuances of course structuring and content that we haven't yet given you the opportunity to voice, now is the time. We truly welcome your feedback! Feel free to add as much discussion as necessary to your project, letting us know how we succeeded, where we failed, and what we can do to make this experience better for all our students and partners in 2025 and beyond. - -We hope you enjoyed the class, and we look forward to seeing you next semester! - -.Items to submit -==== -- firstname_lastname_project14.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2-teachinglearning-backup.adoc deleted file mode 100644 index 024120723..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2-teachinglearning-backup.adoc +++ /dev/null @@ -1,327 +0,0 @@ -= TDM 10100: R Project 2 -- 2024 - -**Motivation:** R is one of the most popular programming languages in the world for statistical and scientific analysics, and as such, it is very important to learn. While it does have many niche aspects that make it particularly suited to data analysis, it also shares a lot of common features with other programming languages. In the next few projects we will be doing a deep dive into these common features, learning about operators, variables, functions, looping and logic, and more! - -**Context:** Project 1's introduction to Jupyter Notebooks will be vital here, and it will be important to understand the basics that we covered last week. Feel free to revisit Project 1 for reminders on the basics! - -**Scope:** R, Operators, Conditionals - -.Learning Objectives: -**** -- Learn how to perform basic arithmetic in R -- Get familiar with conditional structures in R -- Solve a famous programming problem using math! -- Apply your problem solution to real-world data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -== Questions - -[IMPORTANT] -==== -For this project and moving forward, please use the `seminar-r` kernel, not the `seminar` kernel, unless otherwise specified. This will save us the trouble of having to include the `%%R` cell magic in every new cell, because we will be working pretty exclusively in R. - -Hint: If you're having trouble remembering how to do this, please take a look back at Project 1, Question 2 for an in-depth walkthrough. -==== - -=== Question 1 (2 pts) - -The first step to understanding and learning a new programming language is learning its `operators`, symbols that represent logical or arithmetic operations. There are a few different general types of operators, detailed below: - -- arithmetic operators: perform common mathematical operations (i.e `+` is addition, `*` is multiplication, `%%` is modulus AKA remainder from division, i.e. 5%%4 is 1) -- assignment operators: assign values to variables. More on variables in next week's project, but variables are basically names that we assign values (i.e. `x \<- 6` makes the value of x be 6.) R is unique in that you can actually change the direction of assignment. `x\<-6` is the same as `6\->x` -- comparison operators: compare two values and return either `TRUE` or `FALSE`. (For example, `x == y` checks if x and y are equal. `x != y` checks if x and y are not equal. `x \<= y` checks if x is less than or equal to y.) -- logical operators: these are used to create compound comparisons that can check multiple conditions. (i.e. if we wanted to make sure x was less than 5, but greater than 2, we could write `x < 5 && x > 2`) -- membership operators: these allow us to check if a list contains a specific value. If we had a list of numbers named `ourlist`, we could write `5 %in% ourlist` which would return `TRUE` if 5 is in `ourlist` and `FALSE` if it is not. - -These are the basic types of operators we'll work with during this class. For a more exhaustive list, and direct examples of how to use each operator, https://www.w3schools.com/r/r_operators.asp[this website] can help give detailed descriptions of all the different operators in R. - -In these next few questions, you'll be asked to write your own code to perform basic tasks using these operators. Please refer to the above linked website and descriptions for reminders on which operators are which, and how they can be used. - -[NOTE] -==== -In R, everything after a `#` on a line of code is considered a 'comment' by the computer. https://www.w3schools.com/r/r_comments.asp[Comments] serve as notes in your code, and don't do anything when run. It should be a priority to always comment your code well to make it understandable to an outside reader (or you, in the future!). -==== - -[IMPORTANT] -==== -**Precedence** is an important concept with operators, and determines which of our operators "acts first". You can think of it as being similar to the concept of arithmetic order-of-operations in math (like https://www.mathsisfun.com/operation-order-pemdas.html[_PEMDAS_]). https://www.datamentor.io/r-programming/precedence-associativity[This website] details operator precedence in R and is worth taking a look at before attempting the next two questions. -==== - -Before you attempt this question, take a minute to view the below examples. Attempt to run them on your own and observe their behavior. - -[source, r] ----- -a <- 5 # a is 5 (assignment) -a <- a + 1 # a is 6 (addition) -a <- a * 3 # a is 18 (multiplication) - -b <- 2 * (a + 3) # b is 42, a is still 18 -c <- 2 * a + 3 # c is 39, a is still 18 - -d <- a > 10 # d is TRUE -e <- b < 20 # e is FALSE -f <- d && e # f is FALSE - -# print the final results of all operations (\n means new line): -cat("a:", a, "\nb:", b, "\nc:", c, "\nd:", d, "\ne:", e, "\nf:", f, "\n") ----- - -For this question, please complete the following tasks in a single code cell. At the very end of the code cell, please add `print(myVariable)` to print the results of your work. Some starter code has been provided below for your reference. - -[source, r] ----- -# create a new variable named myVariable and assign it a value of 7 -myVariable <- 7 - -myVariable <- # multiply here -myVariable <- # subtract, then multiply here -myVariable <- # add the two values, then multiply here -myVariable <- # complete the rest of the math here - -# print the final value of myVariable (a special date!) -print(myVariable) ----- - -. Create a new variable named `myVariable` and assign it a value of 7. -. Multiply the value of `myVariable` by the number representing your birth month. -. In one line of code, using two arithmetic operators, subtract 1 from the value of `myVariable` and then multiply it by 13 (Hint: You will need to use parentheses!) -. In one line of code, using three arithmetic operators: add 3 to `myVariable`, add the day of your birth to `myVariable`, and then multiply `myVariable` by 11 -. All in one line, subtract the month of your birth and the day of your birth from `myVariable`, divide it by 10, add 11, and then divide by 100. (Hint: You will need to use parentheses!) -. In a https://www.markdownguide.org/cheat-sheet/[markdown cell], write a sentence describing the number you got as the value of `myVariable` at the end of all these operations. Is there anything special about it? (It may or may not be an important date from your life!) -. Print the final value of `myVariable` - -.Deliverables -==== -- A code cell containing the 5 lines of code requested above, and a print statement showing the final value of `myVariable`. -- A markdown cell identifying what is special about the resulting number. -==== - -=== Question 2 (2 pts) - -While we'll cover control structures in greater detail in the next few weeks, let's introduce the basic concept so we can see the **power** of logical operators when used in conditionals! - -Conditionals are exactly what they sound like: blocks of code that perform actions _if_ we satisfy certain conditions. Creatively, we call these _if statements_. In R, _if statements_ are structured like so: - -[source, r] ----- -# general structure -if (condition) { - do this action -} - -# specific example -if (x > 0) { - print("X is a positive number!") -} ----- - -For this question, we want you to use the operators we just learned to perform the following: - -- define a variable `myYear` -- write an `if statement` that prints "Divisible by 4!" if `myYear` is divisible by 4 -- write an `if` statement that prints "Not divisible by 100!" if `myYear` is not divisible by 100 -- write an `if` statement that prints "Leap Year!" if `myYear` is divisible by 4 **AND** myYear is not divisible by 100 - -Here is some skeleton code to get you started (the first if statement is already completed): - -[source, r] ----- -myYear <- 2000 - -if (myYear %% 4 == 0) { - print("Divisible by 4!") -} -if # continue your code here... ----- - -[IMPORTANT] -==== -The `&&` AND operator may be useful here when you want to check that two conditions are _both true_. For more information about the `&&` operator, please refer to the resources presented in the first question. -==== - -To check your work, here are the following test cases: - -- Year 2000 is divisible by 4 and 100 -- Year 2020 is divisible by 4, but not by 100 (meaning it is a _leap year_) -- Year 1010 is not divisible by 100 or 4 - -.Deliverables -==== -- Three _if_ statements as described above. -==== - -=== Question 3 (2 pts) - -Let's continue to build on the foundational concept of _if_ statements. Sometimes, when our first condition is not true, we want to do something else. Sometimes we only want to do something else if _another_ condition is true. In an astounding feat of creativity, these are called _if/else/else-if_ statements, and here is their general structure: - -[source, r] ----- -# general structure (we can have as many else ifs as we want!) -if (condition) { - do this -} else if (other condition) { - do this instead -} else if (third condition) { - do this if we meet third condition -} else { - this is our last option -} - -# we can also have no else if statements if we want! -if (condition) { - do this -} else { - do this instead -} - -# and finally, a concrete example -x <- 5 # you can change 5 to any value you'd like! -if (x > 100) { - print("x is greater than 100!") -} else if (x > 0) { - print("x is a positive number less than 100!") -} else if (x < -100) { - print("x is less than -100!") -} else { - print("x is a negative number greater than -100!") -} ----- - -Feel free to experiment with these examples, plugging in different values of `x` and seeing what happens. Learning to code is done with lots of experimentation, and exploring/making mistakes is a valuable part of that learning experience. - -Let's build on your code from the last problem to create an _if/else/else-if_ statement that is able to identify any and all leap years! Below is the definition of a leap year. Your task for this question is to take the below definition and, defining a variable `myYear`, write an _if/else/else-if_ block that prints "Is a leap year!" if `myYear` is a leap year, and prints "Is not a leap year!" if `myYear` is not a leap year. - -[IMPORTANT] -==== -A year is a leap year if it is divisible by 4 and not 100, _or_ if it is divisible by 100 and 400. To put it in language that may make more sense in a conditional structure: - -If a year is divisible by 4, but not divisible by 100, it is a leap year. Else if a year is divisible by 100 and is divisible by 400, it is a leap year. Else, it is not a leap year. -==== - -[source, r] ----- -myYear <- 2000 - -if ( ... ) { - print("Is a leap year!") -} else if ( ... ) { - print("Is a leap year!") -} -else { - print("Is not a leap year!") -} ----- - -[NOTE] -==== -Here are some test cases for you to use to double-check that your code is working as expected. - -- 2000, 2004, 2008, 2024 are all leap years -- 1700, 1896, 1900, and 2010 are all not leap years -==== - -.Deliverables -==== -- A conditional structure to identify leap years, and the results of running it with at least one year. -==== - -=== Question 4 (2 pts) - -Okay, we've learned a lot in this project already. Let's try and master the concepts we've been working on by making a more concise version of the conditional structure from the last problem. Here are the rules: you must create a conditional structure with only one _if_ and only one _else_. No _else ifs_ are allowed. It has to accomplish fundamentally the same task as in the previous question, and you may use the test cases provided in the previous question as a way to validate your work. Some basic skeleton code is provided below for you to build on: - -[source, r] ----- -myYear <- 2000 - -if ( ... ) { - print("Is a leap year!") -} else { - print("Is not a leap year!") -} ----- - -[IMPORTANT] -==== -For this question, the `||` OR operator, which functions similarly to the `&&` AND operator mentioned in question 2, may be helpful. -==== - -.Deliverables -==== -- The results of running your conditional on at least one leap year. -==== - -=== Question 5 (2 pts) - -Great work so far. Let's summarize what we've learned. In this project, we learned about the different types of operators in R and how they are used, what conditional statements are and how they are structured, and how we can use logical and comparison operators in conditional statements to make decisions in our code! - -For this last question, we'll use what operators and conditionals on real-world data and make observations based on our work! The below code has been provided to you, and contains a few new concepts we are going to cover in next week's project (namely, `for` loops and lists). For now, you don't have to understand fully what is going on. Just insert the conditions you wrote in the last problem where specified to complete the code (you only have to change lines with `===` in comments), run it, and write at least 2 sentences about the results of running your code and any observations you may have regarding that output. Include in those two sentences what percentage of the Olympics were held on leap years. (If you are interested in understanding the provided code, feel free to take some time to read the comments explaining what each line is doing.) - -[IMPORTANT] -==== -The Olympics data can be found at `/anvil/projects/tdm/data/olympics/athlete_events.csv` -==== - -[NOTE] -==== -In the below code, you may have noticed the addition of `.unique()` when we're getting a list of years from our data. We'll refrain from covering this in detail until a future project, but what you can know is that here it takes our list of all years and removes all the duplicate years so we have only one of each year in our resulting `year_list` -==== - -[NOTE] -==== -You will also notice the `cat()` function below, which you can think of as the same as the `print()` function we've been using, but printing everything on one line instead of having a whole bunch of messy output. Feel free to swap `cat()` and `print()` to see the difference between their outputs. -==== - -[source, r] ----- -olympics_df <- # === read the dataset in here (see Project 1 for reminder!) === - -# get a list of each year in our olympics_df using c(), -# and use unique() to remove duplicate years -year_list <- unique(olympics_df$Year) -year_list <- year_list[!is.na(year_list)] # removes all NA values from our list - -# create an empty list for our results -leap_list = c() - -# apply our conditional to each year in our list of years -for (year in year_list) { - if # === add your condition for leap years here === { - # add the year to our list of leap years - leap_list <- append(leap_list, year) - } -} - -# prints our list of leap years and number of leap years -cat("The Olympics were held on leap years in:", sort(leap_list), "\n") -cat(length(leap_list), "of the", length(year_list), "Olympics occurrences in our data were held on a leap year.\n") ----- - -.Deliverables -==== -- The results of running the completed code -- At least two sentences containing observations about the results and what percentage of Olympics are held on leap years -==== - -== Submitting your Work - -Great job, you've completed Project 2! This project was your first real foray into the world of R, and it is okay to feel a bit overwhelmed. R is likely a new language to you, and just like any other language, it will get much easier with time and practice. As we keep building on these fundamental concepts in the next few weeks, don't be afraid to come back and revisit your previous work. As always, please ask any questions you have during seminar, on Piazza, or in office hours. We hope you have a great rest of your week, and we're excited to keep learning about R with you in the next project! - -.Items to submit -==== -- firstname_lastname_project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2.adoc deleted file mode 100644 index fb5efe5df..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project2.adoc +++ /dev/null @@ -1,192 +0,0 @@ -= TDM 10100: R Project 2 -- 2024 - -**Motivation:** R is one of the most popular tools for data analysis. Indexing and grouping values in R are very powerful. (We can do a lot, with just one line of R!) - -**Context:** We will load several data frames in R and will practice indexing the data in several ways. - -**Scope:** R, Operators, Conditionals - -.Learning Objectives: -**** -- Get comfortable with extracting data in R that satisfy various conditions -- Learning how to use indices in R -- Apply these techniques with real-world data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` - -== Questions - -[IMPORTANT] -==== -For this project (and moving forward, when you are using R), please use the `seminar-r` kernel (not the `seminar` kernel), unless otherwise specified. When you use the `seminar-r` kernel with R, you do not need to use the `%%R` cell magic. -==== - -=== Question 1 (2 pts) - -Import the Olympics data from the file `/anvil/projects/tdm/data/olympics/athlete_events.csv` into a data frame called `myDF`. Make a table from the values in the column `myDF$Year` and the plot this table. (Your work will be similar to Project 1, Questions 3, 4, 5.) [Take a look at the resulting plot: Does the resulting plot make sense? For instance: Does it make sense that the number of athletes is increasing over time? Can you see the halt in the Olympics during the two World Wars? Do you see the 2-year rotation between summer and winter Olympics began in the 1990s?] - -++++ - -++++ - -.Deliverables -==== -- A table showing the number of athletes participating in the Olympics during each year. -- A plot showing the number of athletes participating in the Olympics during each year. -- As *always*, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will *document every single question with your comments and your insights*. -==== - -=== Question 2 (2 pts) - -In the Olympics data: - -Which value appears in the "NOC" column the most times? - -Which value appears in the "Name" column the most times? Hint: If you try to view the entire table of values in the "Name" column, the table has length 134732, and it will not finish displaying. For this reason, you should *only* look at the `head` or the `tail` of your table, not the entire table itself. - -++++ - -++++ - -++++ - -++++ - -.Deliverables -==== -- The value that appears in the "NOC" column the most times. -- The value that appears in the "Name" column the most times. -==== - -=== Question 3 (2 pts) - -In the Olympics data: - -When we examine the `head` of `myDF`, notice that the third row is from team "Denmark" while the fourth row is from team "Denmark/Sweden". - -How many rows correspond *exactly* to team "Denmark"? - -How many rows have "Denmark" in the team name ("Denmark" may or may not be the exact team name)? Hint: You can use the `grep` or `grepl` function. - -Find the names of the teams that have "Denmark" in the team name but are not exactly "Denmark". Hint: There should be exactly 72 such rows. - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - -.Deliverables -==== -- The number of rows corresponding *exactly* to team "Denmark". -- The number of rows with "Denmark" as part of the team name. -- The names of teams that have "Denmark" included but are not exactly "Denmark". -==== - - -=== Question 4 (2 pts) - -Not all data comes in a comma-delimited format, i.e., with commas in between the pieces of data. In the data set of donations from the 1980 federal election campaigns, the symbol "|" is placed between pieces of data. - -[source, bash] ----- -C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469 -C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948 -C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949 -C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950 -C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914 -C00078287|A|Q1||79031231889|15||CRAMER, JOHN H|CHICAGO|IL|60685|||02281979|200||||||3061920110000383915 -C00078287|A|Q1||79031231889|15||MCHUGH, KEVIN|CHICAGO|IL|60685|||03051979|200||||||3061920110000383916 -C00078287|A|Q1||79031231889|15||NOHA, EDWARD J|CHICAGO|IL|60685|||03121979|300||||||3061920110000383917 -C00078287|A|Q1||79031231889|15||RYCROFT, DONALD C|CHICAGO|IL|60685|||03191979|200||||||3061920110000383918 -C00078287|A|Q1||79031231889|15||VANDERSLICE, WILLIAM D|CHICAGO|IL|60685|||02271979|200||||||3061920110000383919 ----- - - -Instead of using the `read.csv` function to read in the data, we can use the `fread` function to read in the data, and it will *automatically* detect what symbol is placed between the pieces of data. The `fread` function is not available by default, so we first load the `data.table` library. - -This data set also does not have the names of the columns built in! So we need to specify the names of the columns. - -You can use the following to read in the data and name the columns properly: - -[source, bash] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - -Now that you have the data read into the data frame `myDF`, here are two questions to get familiar with the data: - -Which value appears in the "STATE" column the most times? - -Which value appears in the "NAME" column the most times? Hint: As in question 2, if you try to view the entire table of values in the "NAME" column, the table has length 217646, and it will not finish displaying. For this reason, you should *only* look at the `head` or the `tail` of your table, not the entire table itself. - - -.Deliverables -==== -- The value that appears in the "STATE" column the most times. -- The value that appears in the "NAME" column the most times. -==== - - -=== Question 5 (2 pts) - -In the data set about the 1980 federal election campaigns: - -Use the `paste` command to join the "CITY" and "STATE" columns, with the goal of determining the top 5 city-and-state locations where donations were made. - -Hint: As in questions 2 and 4, if you try to view the entire table of values of city-and-state pairs, the table has length 217646, and it will not finish displaying. For this reason, you should *only* look at the `head` or the `tail` of your table, not the entire table itself. - -Another hint: Please notice the fact that there are 11582 rows in the data set in which the "CITY" and "STATE" are both empty! - -++++ - -++++ - -++++ - -++++ - -.Deliverables -==== -- The top 5 city-and-state locations where donations were made in the 1980 federal election campaigns. -==== - - - - -== Submitting your Work - -Great job, you've completed Project 2! This project was your first real foray into the world of R, and it is okay to feel a bit overwhelmed. R is likely a new language to you, and just like any other language, it will get much easier with time and practice. As we keep building on these fundamental concepts in the next few weeks, don't be afraid to come back and revisit your previous work. As always, please ask any questions you have during seminar, on Piazza, or in office hours. We hope you have a great rest of your week, and we're excited to keep learning about R with you in the next project! - -.Items to submit -==== -- firstname_lastname_project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3-teachinglearning-backup.adoc deleted file mode 100644 index b79b58e62..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3-teachinglearning-backup.adoc +++ /dev/null @@ -1,244 +0,0 @@ -= TDM 10100: R Project 3 -- 2024 - -**Motivation:** So far, we've learned how to set up and run code in a Jupyter Notebook IDE (integrated development environment), perform operations, and set up basic decision structures (_conditionals_) in our code. However, all we've really done so far is just define one value at a time to pass into our conditionals and then changed that value by hand. As you probably realized, this is inefficient and completely impractical if we want to handle lots of data, either iteratively (aka one-by-one) or in some other efficient method (i.e in parallel, by grouping, etc. More on this later...). This project will be dedicated to learning about looping structures and vectorization, some common approaches that we use to iterate through and process data instead of doing it by hand. - -**Context:** At this point, you should know how to read data into an R dataframe from a .csv file, understand and be able to write your own basic conditionals, and feel comfortable using operators for logic, math, comparison, and assignment. - -**Scope:** For Loops, While Loops, Vectorized operations, conditionals, R - -.Learning Objectives: -**** -- Learn to design and write your own `for` loops in R -- Learn to design and write your own `while` loops in R -- Learn about "vectorization" and how we can use it to process data efficiently -- Apply looping and vectorization concepts to real-world data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/datasets/tdm_temp/data/whin/observations.csv` - -== Questions - -=== Question 1 (2 pts) - -Let's first discuss two key foundational concepts to understand when working with loops: lists and indexing. Lists are relatively intuitive: they are a list of elements. In R, lists can contain elements of different data types like strings (`x \<- "Hello World"`) and integers (`y \<- 7`). It is oftentime good practice to have all of the data in a list be the same type of data. When we want this to be enforced by our kernel/interpreter, we can use `vectors` instead of `lists. We will talk more about vectors in a future project. Lists in R are constructed like so: - -[source, r] ----- -# create a list with three integer elements -list1 <- list(23, 49, 5) - -# create a list with three string elements -list2 <- list("hello", "Mister", "Foobar") - -# create an empty list -list3 <- list() ----- - -Now we should talk about the interesting part of lists: adding, subtracting, and accessing elements in lists! Adding and removing elements from lists in R can be done a number of ways, often by using built in functions like `append()`. More exhaustive descriptions of how lists function in R can be found https://www.w3schools.com/r/r_lists.asp[here]. - -Accessing elements of a list, also called _indexing_, is different from language to language. In R, lists are _1-indexed_, meaning the first element in a list is at **index 1**. Accessing elements of a list can be done using square brackets. We can also _slice_ a list, which is simply indexing in such a way that we grab a chunk of elements from the list, as opposed to just one. Some basic examples are shown below, and the link on lists provided in the above paragraph has more detailed descriptions of indexing and slicing in R. - -[source, r] ----- -# create our list, then append a new element -list1 <- list("Jackson", "is terrified of", "spiders") -list1 <- append(list1, "and cockroaches") -print(list1) -print("--------------------------------") - -# print a few elements of our list using indexing -print(list1[1]) # prints "Jackson" -print(list1[3]) # prints "spiders" -print(list1[4]) # prints "and cockroaches" -print("--------------------------------") - -# slice our list to get the two middle elements -print(list1[2:3]) # prints ['is terrified of', 'spiders'] -print("--------------------------------") ----- - -In a more "big data" sense, you can also index into R dataframes! This can be done numerically, like we did with regular lists, or by the name of the column! Below is an example of how we did this in a previous project: - -[source, r] ----- -# read in our data -olympics_df <- read.csv("/anvil/datasets/tdm_temp/data/olympics/athlete_events.csv") - -# index into the dataframe and get the "Year" column -year_list <- olympics_df$"Year" ----- - -[IMPORTANT] -==== -Not all data is the same! `.csv` stands for `comma-separated-values`, and as such, the `read.csv` function that we've been using is looking for commas between each bit of data. However, commas are only one valid separator, and many data files will use pipes `|` or even just spaces to separate data instead. Our `read.csv()` function can still read these in, but you'll have to specify the separator if its not commas. For pipe-separated data (like in this project), you can use something that looks like `read.csv("data.csv", sep="|")` -==== - -For this problem, we are going to introduce some new data from https://data.whin.org/[WHIN], a large weather analysis organization that helps integrate weather and agricultural data in the Wabash Heartland region (that's all around Purdue!). Your tasks are as follows: - -- read the data from `/anvil/datasets/tdm_temp/data/whin/observations.csv` into a dataframe called `obs_df`(Hint: Don't forget to specify the separator! This file uses pipes `|`.) -- index into your `obs_df` dataframe, and store the "temperature_high" column to a new variable called `tempF_list` -- With your newly formed `tempF_list`, print the 101st element - -[NOTE] -==== -If you want to take a look at a small summary of your dataframe, the `head()` function will print the first 5 rows of your data, along with the names of the columns of your data (if they exist). The syntax for this is `head(obs_df)` -==== - -.Deliverables -==== -- a new R dataframe called `obs_df` -- a new list that is the `temperature_high` column of `obs_df` called `tempF_list` -- the 101st element in the `tempF_list` -==== - -=== Question 2 (2 pts) - -Now that we have some idea about how we can store lists of data, let's talk about repetitive tasks. To put it concisely: repetition is bad. When writing code, there should be a focus on avoiding unnecessary repititions, which will help ensure readability and good formatting along with improving our code's speed. When it comes to avoiding repititions, looping structures save the day! - -There are two basic kinds of loops in R: `for` loops, and `while` loops. Their names also encapsulate how they work; `for` loops do some actions _for_ each item in some set/list of items. `while` loops perform some actions _while_ some condition is true. Below are a few basic examples of how these structures can be used with lists. - -[source, r] ----- -ourlist <- list("One-eyed", "One-horned", "Flying Purple", "People Eater") - -# this goes through each number from 0 to 4 and uses it to index into our list -for (i in (1:4)) { - print(paste("The value of i:", i)) - print(paste("List element", i, ":", ourlist[i])) -} - -# we can also iterate directly through a list in R, like this -for (j in ourlist) { - print(j) -} - -# if we introduce a counter variable, we can do the same thing with a while loop! -counter <- 0 -while (counter < length(ourlist)) { # length(ourlist) gives us the length of ourlist - print(paste("The value of counter:", counter)) - print(paste("List element", counter, ":", ourlist[counter])) - counter <- counter + 1 # if you don't update counter, the loop runs forever! -} ----- - -While `for` and `while` loops can often be used to perform the same tasks, one of them will often present a more intuitive approach to completing a task that is worth thinking about before diving straight into the problem. - -Here are a few basic tasks to complete for this problem to get you more familiar with looping: - -- Construct a list of length 10. Call it `mylist`. The elements can be anything you want. -- Using a `for` loop, change all of the even-index elements of the list to be the string "foo" (You can consider `0` to be even) -- Using a `while` loop, change all of the odd-index elements of the list to be the string "bar" -- Using a `for` loop, change all of the elements whose index are divisible by 3 to be "buzz" -- print the final list `mylist` after making all of the above changes - -[NOTE] -==== -Your final list should be `['bar', 'foo', 'buzz', 'foo', 'bar', 'buzz', 'bar', 'foo', buzz', 'foo']` -==== - -.Deliverables -==== -- a list, `mylist`, of length 10, where each element is either foo, bar, or buzz based on the above instructions -- the final list `mylist` after making `foobarbuzz` changes -==== - -=== Question 3 (2 pts) - -Let's bring the looping we just learned to the real-world data we read into our `obs_df` dataframe from Question 1! In this problem, we're going to use looping to perform two tasks. One of these tasks is better suited for a `while` loop, and the other is better suited for a `for` loop. You can get full credit no matter which loop you use for which task. Just ensure that you use each loop only once, and that you complete the tasks' deliverables. - -. If you're an in-state student, you likely didn't have any problem with the temperatures we looked at earlier. However, for most of the rest of the world, it certainly would be a concern to see a number like `63` on their thermometer! For this task, we want you to take the list you created in question 1, `tempF_list`, convert the first 10,000 values to celsius, and store them to a new list called `tempC_list`. (Conversion from Fahrenheit to Celsius is simply `Cels = (Fahr - 32) * 5/9`) - -. With our newly created `tempC_list`, we now have a list of temperatures around the Wabash heartland that are in a more accessible form. However, we want to do more than just unit conversion with this data! For this task, print a count of how many times in `tempC_list` the temperatures are higher than 24 degrees Celsius in the first 10,000 elements in the list. Also print what percentage of those elements are greater than 24 degrees Celsius (Hint: % = (count / total) * 100) - -[NOTE] -==== -Appending to a list using the `append` function can actually be pretty slow, and there are some vastly better ways of performing these tasks in R than using loops. We'll cover those in the next two questions, but if your code is taking a long time to run, try adding new values to `tempC_list` by just using `tempC_list[i] \<- # conversion stuff` instead of `append()`. -==== - - -.Deliverables -==== -- The `tempF_list` from Question 1 converted to Celsius -- The number of temperatures in `tempC_list` greater than or equal to 24 degrees Celsius -- The percentage of `tempC_list` greater than or equal to 24 degrees celsius -==== - -=== Question 4 (2 pts) - -Fantastic! We learned what loops were, used them on a few small lists of our own creation, and then successfully applied them to real-world data in order to complete practical tasks! At this point, you're probably thinking "Wow! Lists are super useful! I'm so glad I learned all there is to know and I never have to learning anything else again!" - -...But what if I told you there was an even better way to work with lists? Introducing: vectorization. When we want to perform common actions to every element in a list, array, dataframe, or similar, R presents us with easy ways to do that action, in parallel, to all the items in our list. This is not only a lot easier to read than a loop (it takes about 1 line of vectorized code to do the same task as the 3-4 lines of looping we wrote earlier), its also a lot more efficient, as there are some neat tricks going on behind the scenes to speed things up. - -In the same vein of thinking, we can also slice our lists/arrays/dataframes based on conditions. This also ends up being a lot more readable and efficient than looping, and is only a slight extension to the idea of slicing we covered earlier in this project. - -Below are some examples that are relevant to the tasks you'll be working on during this problem. - -[source, r] ----- -# read in the data -obs_df <- read.csv("/anvil/datasets/tdm_temp/data/whin/observations.csv", sep="|") - -# use vectorized operations to create a new column in our -# dataframe with temperatures converted to the rankine scale -obs_df$"temperature_Rankine_high" <- obs_df$"temperature_high" + 459.67 - -# use vectorized operations to create a list as a subset of the temperature_high column -temperature_under75_high <- obs_df$"temperature_high"[obs_df$"temperature_high" < 75] - -# print the first few entries in our new column -print(head(obs_df$"temperature_Rankine_high", 3)) -print(head(temperature_under75_high, 3)) ----- - -For this problem, create a new column in your dataframe called `myaverage_temp`. This column should be the sum of the `temperature_high` and `temperature_low` divided by 2. - -[NOTE] -==== -If you run `head(obs_df$myaverage_temp)`, the first six elements in the column should be 70.5, 69.5, 76.5, 76, and 76, 75.5. -==== - -.Deliverables -==== -- a new column, `myaverage_temp`, that is the average of the `temperature_high` and `temperature_low` columns -==== - -=== Question 5 (2 pts) - -Let's finish up this project by taking the loops we wrote in Question 3 and rewriting them as one-line vectorized operations. Let's briefly rehash the loops we need to vectorize for this problem. - -. Write a one-line vectorized operation that creates a new column in our dataframe, `temperature_high_celsius`, that is the `temperature_high` column with its values converted from Fahrenheit to Celsius. -. Write a one-line vectorized operation that creates a new list, `my_hightemps`, with all of the values from the `temperature_high_celsius` that are greater than or equal to 24 degrees celsius -. Print the head of your new column and list (hint: this is demonstrated in the previous question) - -The example code provided in the previous problem is quite similar to what you're being asked to do in this problem, so feel free to use it as a starting point! - -.Deliverables -==== -- The `temperature_high_celsius` column as described above -- The `my_hightemps` list as described above -- The heads of each column/list -==== - -== Submitting your Work - -Whew! That project was tough! Looping, indexing, and vectorization are extremely important and powerful concepts, and its no small feat that you made it through this project! If you still feel that it would be tough for you to write a loop or vectorized operation from scratch, consider going back and slightly modifying questions, coming up with your own problems and solutions as practice. - -Next week we will slow down a bit and talk about _semantic structure_, the art of writing and commenting your code so it is beautiful, readable, and easy to understand. If these last couple projects have been a bit intense, this next one should be a welcome relief. As always, attend seminar, post to Piazza, and otherwise come to some office hours and get any and all the help you need! I hope that you are enjoying the class so far, and I look forward to continuing to learn with you all next week. - -.Items to submit -==== -- firstname_lastname_project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3.adoc deleted file mode 100644 index f292b4303..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project3.adoc +++ /dev/null @@ -1,170 +0,0 @@ -= TDM 10100: R Project 3 -- 2024 - -**Motivation:** Now that we are comfortable with the `table` command in R, we can learn about the `tapply` command. The `tapply` will apply a function to values in one column, which are grouped according to another column. This sounds abstract, but once you see some examples, it makes a lot of sense. - -**Context:** `tapply` takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column. - -**Scope:** `tapply` in R - -.Learning Objectives: -**** -- Learning about how to apply functions to data in groups -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/icecream/combined/products.csv` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - -Three examples of the `tapply` function: - -*Example 1* Using the 1980 election data, we can find the amount of money donated in each state. - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - -We take the money for the donations (in the `TRANSACTION_AMT` column), group the data according to the state (in the `STATE` column), and sum up the donation amounts: - -`tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum)` - -++++ - -++++ - - -*Example 2* Using the ice cream products data, we can find the average rating for each brand. - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv") ----- - -We take the rating (in the `rating` column), group the data according to the brand (in the `brand` column), and take an average of these reviews: - -`tapply(myDF$rating, myDF$brand, mean)` - -++++ - -++++ - -*Example 3* Using the 1990 airport data, we can find the average departure delay for flights from each airport. - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv") ----- - -We take the departure delays (in the `DepDelay` column), group the data according to airport where the flights depart (in the `Origin` column), and take an average of these departure delays: - -`tapply(myDF$DepDelay, myDF$Origin, mean)` - -The values show up as "NA" (not available) because some values are missing, so R cannot take an average. In such a case, we can give R the fourth parameter `na.rm=TRUE` so that it ignores missing values, and we try again: - -`tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=TRUE)` - -[TIP] -==== -For Dr Ward, using the Firefox browser, 1 core was enough for this entire project, but Dr Ward met one student who demonstrated that he needed 2 cores, even in Firefox. So if you cannot load the 1990 flights subset with 1 core, then you might want to try it with 2 cores. Please make sure that you are using Firefox. -==== - -++++ - -++++ - -=== Question 1 (2 pts) - -Using the 1990 airport data, find the average arrival delay for flights arriving to each airport. - -[TIP] -==== -The arrival delays are in the `ArrDelay` column, and the planes arrive at the airports in the `Dest` (destination) column. -==== - -[TIP] -==== -In the three examples at the start of the project (before Question 1), we used: - -[source, R] ----- -library(data.table) -myDF <- fread( put my file location here ) ----- - -to load the data. I recommend that you use the `fread` function to load your data too (rather than `read.csv`). -==== - - -=== Question 2 (2 pts) - -In the grocery store file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Find the sum of the amount spent (in the `SPEND` column) at each of the store regions (the `STORE_R` column). - - -=== Question 3 (2 pts) - -In the grocery store file (same file from question 2): - -Find the total amount of money spent in 2016 altogether, and the total amount of money spent in 2017 altogether. (You can use the `tapply` to do this with just one cell.) - - -=== Question 4 (2 pts) - -In the Olympics file `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -Find the average height of the athletes in each country (the country is the `NOC` column). - -[TIP] -==== -Remember to use `na.rm=TRUE` because some of the athelete heights are missing. -==== - -=== Question 5 (2 pts) - -In the Olympics file (same file from question 4): - -Find the average height of the athletes in each sport (the sport is the `Sport` column, of course!). After finding these average heights, please sort your results. In which sport are the athletes the tallest (on average)? Does this make sense intuitively, i.e., is height an advantage in this sport? - -[TIP] -==== -Again, remember to use `na.rm=TRUE` because some of the athelete heights are missing. -==== - - -== Submitting your Work - -We only learned about `tapply` in this project because it is a short week, but it is powerful! As always, please ask any questions you have, on Piazza, or in office hours. We hope you have a nice Labor Day weekend! - -.Items to submit -==== -- firstname_lastname_project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4-teachinglearning-backup.adoc deleted file mode 100644 index 7b255c126..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4-teachinglearning-backup.adoc +++ /dev/null @@ -1,198 +0,0 @@ -= TDM 10100: R Project 4 -- 2024 - -**Motivation:** Being able to write R to perform tasks and analyze data is one thing. Having other people (or you, in the future) read your code and be able to interpret what it does and means is another. Writing clean, organized code and learning about basic syntax and whitespace rules is an important part of data science. This project will be dedicated to exploring some syntactic and whitespace-related rules that we've glossed over in previous projects, along with exploring some industry standards that are good to keep in mind when working on your own projects - both for this class and in the rest of your life. - -**Context:** We'll continue to use conditionals, lists, and looping as we move forward, but we won't be spending as much time on reviewing them individually. Feel free to review past weeks' projects and work for refreshers, as the groundwork we've laid up to this point will be the foundations we build on for the rest of this semester. - -**Scope:** Syntax, whitespace, nesting/code blocks, styleguides - -.Learning Objectives: -**** -- Know what syntax is and why its important -- Understand the role code blocks play in R and how to use them -- Develop some basic ideas about how to make your code look cleaner and limit nesting/spaghetti code -- Read up on some basic industry standards regarding style -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -== Questions - -=== Question 1 (2 pts) - -Firstly, let's explore _syntax_. Syntax is, simply put, a set of rules that are agreed upon to dictate how to structure a language. In the code world, _syntax_ often refers to the specific **keywords** that reside in each programming language, along with what symbols are used for things like operators, looping and conditionals, functions, and more! Additionally, syntax can refer to spacing. For example, while the below code is valid and will not produce errors if run by a _C_ compiler that is following _C syntax_, it would not work at all if run by an R interpreter following _R syntax_: - -[source, C] ----- -for (i=0; i<10; i++) { - printf("We're on loop %d", i); -} ----- - -Another good example would be operators. In `Python`, for example, the modulus operator is `%`. In `R`, however, you (hopefully now) know that the modulus operator is `%%`. Below is some R code using concepts we covered in previous projects. However, it has some syntax errors in it that make it error out. Your task in this question is to find the syntax errors, correct them, and run the code to figure out the secret sentence that is printed when the code is correct! (Hint: each line has one syntax error to fix excluding closing `}` lines, for a total of 6 syntax errors to fix! Running the code should give you hints as to what the errors are.) - -[NOTE] -==== -The `seq()` function you see below is likely intuitive, but we'll define it more concretely: It creates a list from the `from` argument, to the `to` argument, in steps of the `by` argument. To create a list like `c(1, 4, 7, 10, 13, 16, 19, 22)` we could use `seq(from = 1, to = 22, by = 3)`. Its very useful for going throw certain indices in a list, and the below code is an example of this. -==== - -[NOTE] -==== -You will also notice the `cat()` function below, which you can think of as the same as the `print()` function we've been using, but printing everything on one line instead of having a whole bunch of messy output. Feel free to swap `cat()` and `print()` to see the difference between their outputs. -==== - -[source, r] ----- -secret < c("P", "i", "ur", "s", "du", " a", "e ", "ma", "Dat", "z", "a", "i", " M", "ng", "ine ", "!") - -for i in seq(from = 1, to = length(secret), by = 2) { - cat(secret[i] -} -for (i within seq(from = 1, to = length(secret))) { - if (i % 2 == 0) { - cat(secret(i)) - } -} ----- - -.Deliverables -==== -- The fixed version of the above code, and the secret sentence that results from running it -==== - -=== Question 2 (2 pts) - -As we move from the idea of syntax onto style and code cleanliness, let's first discuss an important concept in R: 'code blocks'. While languages like Python use whitespace and indentation to denote when code is "inside of" a loop or conditional, most languages (R included) denote when code falls "inside of" other code by creating "code blocks", which are simply blocks of code surrounded by some wrapping symbols like `{}`. We've been doing this automatically in previous projects, but now let's explore it more intentionally. Take a look at the two examples below: - -[source, r] ----- -# example 1 -for (i in seq(1,5)) { - cat("Loop Number ", i, "\n") - cat("Loop complete!") -} - -# example 2 -for (i in seq(1,5)) { - cat("Loop Number ", i, "\n") -} - cat("Loop complete!") ----- - -[NOTE] -==== -The `\n` character inside the above `cat()` function is called a newline, and will start our next print statement on the next line. Feel free to remove it and re-run the examples to see the difference it makes. -==== - -As you can see by running this code in a Jupyter notebook, the results of each example are drastically different based only on placement of curly braces `{}`. - -While often the R interpreter will catch errors in your code in advance and stop the code from running, this is not always the case (as demonstrated in the above examples). Many times, when code blocks are not arranged as intended, errors that don't stop your code from running will happen. These are often called 'runtime errors' and can be tricky to catch until they start causing unintended results in your code. - -Below is some R code to count the number of times the number "4" appears in a randomly generated list of 1000 numbers. However, this code contains 2 errors with code blocks. Fix it so that it correctly counts the number of times "4" appears in our list. - -[source, r] ----- -# generate a 1000 number list of random numbers from 1-100 -number_list <- sample(seq(1,100), 1000, replace=TRUE) -count <- 0 - -for (number in number_list) { - if (number == 4) { - cat("4 Detected!\n") - } - count <- count + 1 - cat("Loop complete! Total number of 4's: ", count, "\n") -} ----- - -.Deliverables -==== -- Results of running the code above after correcting the two code block errors present -==== - -=== Question 3 (3 pts) - -Great! We now have a more formal idea behind the indentation we've been doing throughout our projects so far. Now let's explore the concept of `nesting`. `Nesting` is when some code falls 'within' other code. For example, actions within a conditional or a for loop are nested. Generally, we try and keep nesting to a minimum, as tracking 10 levels of indentation in your code to see what falls within where can be quite difficult visually. Here is an important example to prove that being careful while nesting is necessary, using the Olympics data we used in a previous project: - -[source, r] ----- -# read in our olympics dataframe -olympics_df = read.csv("/anvil/projects/tdm/data/olympics/athlete_events.csv") - -# pick just the olympian from row 200 of our dataframe -my_olympian = olympics_df[200, ] - -# what does any of this mean? Very unreadable, bad code -if (my_olympian$"Sex" == "M") { - if (my_olympian$"Age" > 20) { - print("Class 1 Athlete!") - if (my_olympian$"Age" < 30) { - print("Class 2 Athlete!") - } - if (my_olympian$"Height" > 180) { - if (my_olympian$"Weight" > 60) { - print("Class 3 Athlete!") - } - } - print("Class 4 Athlete!") - } -} ----- - -[IMPORTANT] -==== -In the context of the above example, an olympian can be an athlete of multiple classes at the same time. -==== - -If you think this code is unreadable and its hard to tell what it means to be a class 1 vs 2 vs 3 vs 4 athlete (classes entirely made up), you're correct. Nesting unnecessarily and in ways that don't make code easy to read can quickly render a decent project into unreadable spaghetti. - -Take a good look at the above code. Are there any unnecessary classes that mean the same thing? How could you rewrite it using all that you've learned so far to make it more readable (for example, using _else-if_ and _else_)? For this question, copy this code into your Jupyter notebook and make changes to render it readable, reducing nesting as much as possible. Your final code should have the following features: - -- 3 classes, with the one unnecessary duplicate class removed -- No more than a maximum level of nesting of 2 (aka, 3 blocks deep on the deepest level) -- Should produce the same results as the messy code (minus the unnecessary class) - -[NOTE] -==== -One good way to test your work here would be to run your clean version and the messy version on a couple different olympians (by changing `X` in the `my_olympian = olympics_df.iloc[X]` line) and making sure both versions produce the same results. -==== - -.Deliverables -==== -- A cleaned up version of the messy code provided -- The results of running both clean and messy versions of the code on the same athlete -==== - -=== Question 4 (3 pts) - -For our last question on this project, we want you to explore some different style conventions suggested as standards for writing R, and write about a few that sound interesting to you. Please visit http://adv-r.had.co.nz/Style.html[this R Style Guide] by famous statistician and R contributor, https://en.wikipedia.org/wiki/Hadley_Wickham[Hadley Wickham], and pick 3 different conventions discussed in the guide. For each convention, write a snippet of code that demonstrates the convention. At the end of the question, in a markdown cell, write at least a sentence or two about each convention describing what it is and why it is important. - -.Deliverables -==== -- 3 R code snippets demonstrating three different style conventions -- a markdown cell with at least 3-6 sentences describing the conventions picked and their utility -==== - -== Submitting your Work - -If you're at this point, you've successfully capped off our introduction to whitespace, nesting, and styling code in R. Leaving this project, you should have a better understanding of a lot of the less straightforward elements of writing code and how more abstract concepts like style and indentation can drastically affect the quality of your code, even if it functions as intended. Remember that this was only an introduction to the topics, and throughout your career you'll always be picking up new tricks and style conventions as you gain more experience and meet new people. - -Next week, we'll look more deeply at variables, variable types, and scope, and learn how profound the statement `x <- 4` in R really is! - -.Items to submit -==== -- firstname_lastname_project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4.adoc deleted file mode 100644 index 6d6734b10..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project4.adoc +++ /dev/null @@ -1,196 +0,0 @@ -= TDM 10100: R Project 4 -- 2024 - -**Motivation:** We continue to practice using the `tapply` function. - -**Context:** `tapply` takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column. - -**Scope:** `tapply` in R - -.Learning Objectives: -**** -- Learning about how to apply functions to data in groups -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` - - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - -[TIP] -==== -If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. -==== - -=== Question 1 (2 pts) - -In the death records file: - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -Find the mean `Age` of death for each `Sex`. - -.Deliverables -==== -- Show the mean `Age` of death for each `Sex`. -==== - - -=== Question 2 (2 pts) - -In the grocery store file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Find the total amount of money spent on each `PRODUCT_NUM`. (You can just add up the values in the `SPEND` column, grouped according to the `PRODUCT_NUM` value. You can ignore the `UNITS` column.) Display the top 10 types of `PRODUCT_NUM`, according to the total amount of money spent on those products (i.e., according to the `sum` of the `SPEND` column for those 10 `PRODUCT_NUM` values). - -.Deliverables -==== -- Show the top 10 types of `PRODUCT_NUM`, according to the total amount of money spent on those products (i.e., according to the `sum` of the `SPEND` column for those 10 `PRODUCT_NUM` values). -==== - -=== Question 3 (2 pts) - -In this file of beer reviews `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -Consider the mean beer scores on each date. - -Find the three dates on which the mean score is a 5. - -[HINT] -==== -A mean `score` of "5" is a perfect score, so you can use the `tapply` function, taking the mean of the `score` values, grouped according to the `date`. You will need to sort the results and consider the tail. -==== - -.Deliverables -==== -- Show the three dates on which the mean `score` is a 5. -==== - - -=== Question 4 (2 pts) - -Revisit the video at the very *beginning* of Project 3, Example 1, in which we found the amount of money donated in each state, during the 1980 federal election cycle. - -`/anvil/projects/tdm/data/election/itcont1980.txt` - -This time, instead of finding the amount of money donated in each state in 1980, please find the amount of money donated in each city-and-state pair. - -To accomplish this, paste the city and state together like this: - -`paste(myDF$CITY, myDF$STATE, sep=", ")` - -We use a comma and a space for the separator in the paste function. Take the paste and use it as the *second* element in Project 3, Example 1, so that we group the data according to the `CITY` and the `STATE`. The goal is to show the top 20 city-and-state pairs, according to the amount of money donated. - -[HINT] -==== -Your answer will need to use the `sort` function and the tail function, like this: - -`tail(sort(tapply( ..., ..., ... )), n=20)` -==== - - -[HINT] -==== -Here are the top 6 city-and-state pairs (notice that the top result has a blank city-and-state pair, namely, many of the donations have a blank city-and-state): - -[source, bash] ----- -WASHINGTON, DC - 4273606 -LOS ANGELES, CA - 4569952 -DALLAS, TX - 4748262 -HOUSTON, TX - 7606806 -NEW YORK, NY - 11345027 -, - 17299729 ----- - -In your solution, you need to show the top 20 of the top city-and-state pairs. -==== - -.Deliverables -==== -- Show the top 20 city-and-state pairs, according to the amount of money donated. -==== - - -=== Question 5 (2 pts) - -Revisit the video at the very *beginning* of Project 3, Example 3, in which we studied the departure delays (`DepDelay`) in the 1990 flight data: - -`/anvil/projects/tdm/data/flights/subset/1990.csv` - -This time, instead of finding the mean departure delays according to where the flights depart (in the `Origin` column), please find mean departure delays on each Month / DayofMonth / Year triple - -To accomplish this, paste these three columns together like this: - -`paste(myDF$Month, myDF$DayofMonth, myDF$Year, sep="/")` - -We use a slash for the separator in the paste function. Take the paste and use it as the *second* element in Project 3, Example 3, so that we group the data according to the Month / DayofMonth / Year triple. The goal is to show the worst 6 dates from 1990, according to the largest mean departure delay (`DepDelay`) values. - -[HINT] -==== -Your answer will need to use the `sort` function and the tail function, like this: - -`tail(sort(tapply( ..., ..., mean, na.rm=TRUE)))` -==== - - -[HINT] -==== -Here are the worst two dates from 1990, according to the largest mean departure delay (`DepDelay`) values. - -[source, bash] ----- -12/22/1990 - 45.2222488995598 -12/21/1990 - 45.6617816091954 ----- - -In your solution, you need to show the worst 6 dates from 1990, according to the largest mean departure delay (`DepDelay`) values. -==== - -.Deliverables -==== -- Show the worst 6 dates from 1990, according to the largest mean departure delay (`DepDelay`) values. -==== - - - - - -== Submitting your Work - -We hope that you enjoyed this additional practice with the `tapply` function in this project! - -.Items to submit -==== -- firstname_lastname_project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5-teachinglearning-backup.adoc deleted file mode 100644 index d2a2c25c0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5-teachinglearning-backup.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= TDM 10100: R Project 5 -- 2024 - -**Motivation:** So far in this class, we've been storing values to variables without really discussing any of the specifics of what's going on. In this project, we're going to do a detailed investigation of what _variables_ are, the different _types of data_ they can store, and how the _scope_ of a variable can affect its behavior. - -**Context:** There will be callbacks to previous projects throughout this project. Knowledge of basic operations with reading in and working with dataframes, constructing conditionals and loops, and using vectorized functions will be used in this project. - -**Scope:** Variables, types, and scoping - -.Learning Objectives: -**** -- Understand the concept of variables more widely -- Know the common types in R and how to use them -- Understand what scoping is and basic best practicess -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/bay_area_bike_share/baywheels/202306-baywheels-tripdata.csv` - -== Questions - -=== Question 1 (2 pts) - -A variable, at its most foundational level, is simply a _named area in memory_ that we can store values to and access values from by referencing its given name. In R, variable names are considered valid as long as they start with either an uppercase or lowercase letter in addition to not being any of the _Reserved Keywords_ built into R (examples of reserved keywords are `TRUE`, `None`, `and`, and more. A full list of reserved keywords can be https://www.geeksforgeeks.org/r-keywords/[found here]). Running `help(reserved)` will also print a list of reserved words in your Jupyter notebook. - -To review some of the concepts we've used in previous projects, your task in this question is to perform the following basic operations and assignments using variables: - -- Create a variable named `myname` and assign it the value of your name -- Create a variable named `myage` and assign it the value of your age -- Create a variable named `my_fav_colors` and assign it your top 3 favorite colors -- Create a variable named `about_me` and assign to it a list containing `myname`, `myage`, and `my_fav_colors` -- print `about_me` - -.Deliverables -==== -- The four lines of code specified above, and the results of running that code -==== - -=== Question 2 (2 pts) - -Alright, let's quickly review your work from the last problem. In your assignment statements, you likely used quotes around your name, nothing around your age and `c()` around your lists. But why did you do that? - -The answer: `types`. Data can come in different types, and each type of data has a specific notation used to denote it when writing it. Let's quickly run through some basic types in R: - -- Characters are used to store text data like `"Hello World!"` (In other languages, these are often called _strings_, whereas characters would only refer to single character data like `"a"`) -- Integers are used to store whole number data like `5`, `1000`, or `-30` -- Numeric are used to store decimal numbers like `5.534234` or `0.1` (but can also hold whole numbers like integers can) -- Lists are used to store lists of values. In R, lists can contain different types at the same time. As demonstrated in the previous example, lists can also contain other lists! -- Booleans are logical truth values. The two main R booleans are `TRUE` and `FALSE` -- Sets, dicts, tuples and more data types also exist in R! In the next project we'll cover sets, dicts, and tuples in greater detail, as they can be very useful for organizing data. For now, just keep in mind that they exist. - -That's a lot, and these are just the basic types in R! When we import a library like `tidyverse`, we also get any of the types they define! `Dataframes` are their own type as well, and each column of a dataframe also typically has a type! - -Let's take a look at some real data and types. Read the Baywheels dataset (located at "/anvil/projects/tdm/data/bay_area_bike_share/baywheels/202306-baywheels-tripdata.csv") into a dataframe called `bike_data`. - -Once you've read the data in, use the https://stat.ethz.ch/R-manual/R-devel/library/utils/html/str.html[`str()` function] to list the data type of each column. Then use `head()` to print the first few rows of our dataframe. - -Read through this https://www.w3schools.com/r/r_data_types.asp[documentation page] and, in a markdown cell, give a brief description of the types of data present in our dataframe and what they are used for. You can use `ctrl+f` to search for the type on the documentation page and read the descriptions given. Write at least a sentence or two on each type in our dataframe. - -.Deliverables -==== -- A sentence or two on each of the types in our dataframe and what they are used to store -==== - -=== Question 3 (2 pts) - -Fantastic, we've now got a feel for the different types available in our data. As a bit of an aside, let's spend this question cleaning up our data before we start experimenting on it. When you printed the head of our dataframe, you likely observed a few empty entries. Often, for some reason or another, our data won't always have every column filled in for each row. This is okay, and we will explore ways to handle missing data in future projects, but for now let's learn how to isolate the data that is complete. - -First off, note the size of the dataframe currently. If you don't remember how to do this, we introduced the function in project 1. Refer to the documentation https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/dim[here] for additional help. (https://www.rdocumentation.org[rdocumentation.org] is a **fantastic** resource for R. If you get stuck it is a great place to look up what certain functions/keywords do in R.) - -Next, read through https://tidyr.tidyverse.org/reference/drop_na.html[this documentation page] to get a feel for how the `drop_na()` function works. Be sure to scroll down to the bottom of the page to see a few basic examples. Import the tidyverse library using `library(tidyverse)` and then apply the `drop_na()` function to our dataframe. - -[NOTE] -==== -When loading `tidyverse` with `library(tidyverse)`, you may see a warning about conflicted packages. This is due to different versions of the same code, and is an important thing to note, but will not cause any issues in the projects for this class. Go ahead and ignore it for now. If you would rather not have to see that warning in your final submission, running the cell containing `library(tidyverse)` twice will hide the warning. -==== - -Finally, print the size of the dataframe after dropping all the incomplete rows. How much smaller did it get? In a markdown cell, list the beginning and ending sizes of our dataframe, how many rows were dropped for containing bad values, and what percentage of our total dataframe was dropped. - -.Deliverables -==== -- Size of `bike_data` before and after `drop_na()` -- Percentage of `bike_data` lost to `drop_na()` -- Number of incomplete rows in our original `bike_data` dataframe -==== - -=== Question 4 (2 pts) - -Now that we've read in our data, understand the types available to us, and have cleaned up our nonexisting data, let's begin analyzing it and understanding how variables interact with operators. _Generally_, it is not good practice to try and apply operatons to different variables with different types (i.e. `"hello world!" + 5`) and the R interpreter will typically stop you from doing this. Between two variables of the same type, however, many operators have defined behaviors that we haven't yet explored. - -In addition, some types have their own operators that perform specific actions different from other types. For example, integers can be added together like `5 + 3`. However, in R, we can't simply add two strings like `"Hello " + "World!"`. Instead, we use the `paste()` function. Take a look at the below example of how to use paste() - -[source, r] ----- -var1 <- "My name is" -var2 <- "Firstname" -var3 <- "Lastname!" - -sentence <- paste(var1, var2, var3) -cat(sentence) ----- - -The above example is one of _concatenation_, the joining of two or more strings together, and has powerful practical applications. - -Let's explore the power of concatenation. Consider our bike data: if we want to figure out how many bikes we should put at each station, we'll likely need to understand which stations are used most often. Furthermore, we may want to know what trips are made most often, so that we can put more e-bicycle charging ports at spots along those trips. In order to find out what trips are made most often, we _could_ just count the number of trips that have both the same `start_station_id` and `end_station_id` _or_ we could construct a new column from those two columns, and then count our new "compound column" instead, which has the potential for making our code run a _lot_ faster. - -Take a look at the below example, where I am adding the `ride_id` and `rideable_type` columns to create a new column called `id_and_type` and then getting a count of the different id-type combos in our dataframe. Using a very similar structure, combine the `start_station_id` and `end_station_id` columns into a new column called `trip_id`, and return the top 5 trip IDs in our data. - -[NOTE] -==== -You likely noticed that `paste()` inserts a space between each string it is concatenating. Because we don't always want to insert anything between the strings we are joining, we can simply use the `paste0()` function, which does the same thing as the `paste()` function but doesn't insert a space in between each string we are concatenating. -==== - -[source, r] ----- -# create new column -bike_data$id_and_type <- paste0(bike_data$ride_id, "->", bike_data$rideable_type) - -# print dataframe to observe new column -print(head(bike_data, 2)) - -# get count of top 5 values for each id-type combo in ascending order -# (note there is only one of each combo) -head(sort(table(bike_data$id_and_type), decreasing=TRUE)) ----- - -[IMPORTANT] -==== -You may notice that we selected a column from our data frame in the above example using `bike_data$id_and_type` instead of `bike_data$"id_and_type"`. While the quotes are not necessary generally, when column names have special characters like ":" or "\<-" R may not be able to read the column name correctly unless its wrapped in quotes. -==== - -[NOTE] -==== -You may have some empty values, and that is okay! We won't worry about it for this problem, and both answers that have the empty values removed and those that don't will be accepted for full credit. -==== - -.Deliverables -==== -- A new column in `bike_data` called `trip_id` -- A count of the top 5 trip IDs in the data -==== - -=== Question 5 (2 pts) - -As a way to finish up this project, let's solve a problem and introduce an important concept that will be extremely relevant in the next few weeks: scope. Scope, simply put, is the level at which a variable exists. Variables with larger scope can be referenced in a wider amount of settings, whereas variables with extremely small scope may only be referenceable within the loop, function, or class that they are defined in. In R, scope really only exists in regards to functions. We'll cover functions in detail soon, but for now, just note that they are similar to loops in that they have a header (similar to `if` or `for`) and body (code within `{}` that is 'inside' the function). When variables are defined in a function, they don't exist outside that function by default. However, rather uniquely to R, variables defined in loops do exist outside the loop by default. - -As a quick example, run the following code in your Jupyter notebook: - -[source, r] ----- -for (i in seq(5)) { - # do nothing -} - -# shows that i exists even after the for loop ends -print(i) - -# define a function -foo = function() { - # inside our function, define a variable then end function - bar <- 3 -} - -# run our function, then try and print bar -# notice that bar does not exist outside the function's body -# so we get an error -foo() -print(bar) ----- - -After you run that code in your notebook, give https://www.r-bloggers.com/2022/09/global-vs-local-assignment-operators-in-r-vs/[this webpage] a read. In a markdown cell, write a sentence or two about what making a variable 'global' using the global assignment operator `<\<-` does. Then, write a sentence or two about how we could use `global` to make `bar` defined, even outside of our function's body. Again, you don't have to understand deeply how functions work at this point. - -.Deliverables -==== -- The results of running the above code -- A sentence or two on the `<\<-` operator -- A sentence or two on how to make `bar` exist outside of `foo()` -==== - -== Submitting your Work - -Now that you've completed this project, you hopefully have a much more in-depth understanding of variables and data types along with an introduction to data cleaning and variable scope! This project was quite broad, and next week we will be back to laser-focusing with a detailed investigation into dictionaries, sets, and tuples, three data types we mentioned in this project but warrant their own investigation. After that we'll be moving onto arguably the most important concept in all of code: functions. - -We are getting close to halfway through the semester, so please make sure that you are getting comfortable developing a workflow for these projects and learning the concepts incrementally. A lot of these concepts are very hierarchical: they build on top of each other. If you struggled with something in this project or any of the prior ones, I would encourage you to take advantage of one of the many avenues for getting advice or the opportunity to work with one of our TAs or Dr. Ward, so that going forward you are on the best possible footing for upcoming projects. Have a great rest of your week, and I look forward to working with you all in the next project. - -.Items to submit -==== -- firstname_lastname_project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5.adoc deleted file mode 100644 index dea5382bd..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project5.adoc +++ /dev/null @@ -1,176 +0,0 @@ -= TDM 10100: R Project 5 -- 2024 - -**Motivation:** Real world data has a lot of missing data. It is also helpful to be able to take a subset of data. - -**Context:** It is worthwhile to be prepared to have missing data and to know how to work with it. - -**Scope:** Dealing with missing data, and taking subsets of data. - -.Learning Objectives: -**** -- Learning about how to work with missing data and how to take subsets of data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` - - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - -[TIP] -==== -If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. -==== - -Example 1: - -++++ - -++++ - -Example 2: - -++++ - -++++ - -Example 3: - -++++ - -++++ - -Example 4: - -++++ - -++++ - -Example 5: - -++++ - -++++ - - -=== Question 1 (2 pts) - -In the death records file: - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -a. Build a subset of the data for which `Sex=='F'` and check the head of the subset to make sure that you only have 'F' values in the `Sex` column of your subset. - -b. Make a table of the `Age` values from the subset of female data in question 1a, and plot the table of these `Age` values. (Notice that 999 is used when the `Age` value is missing in part 1b!) - -c. Now revise your subset from question 1a, so that you build a subset of the data for which `Sex=='F' & Age!=999` and then make of table of the `Age` values from this revised subset of female data and plot the table of these `Age` values. - - -.Deliverables -==== -- a. The head of the subset of data for which `Sex=='F'` -- b. Plot of the table of `Age` values for the subset in 1a. -- c. Revise questions 1a and 1b so that `Sex=='F' & Age!=999` -==== - - -=== Question 2 (2 pts) - -In the grocery store file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -there are more than 10 million lines of data, as we can see if we check `dim(myDF)`. Each line corresponds to the purchase of an item. The `SPEND` column is negative when a purchase is refunded, i.e., the item is returned and the money is given back to the customer. - -Create a smaller data set called `refundsDF` that contains only the lines of data for which the `SPEND` column is negative. Make a table of the `STORE_R` values in this `refundsDF` subset, and show the number of times that each `STORE_R` value appears in the `refundsDF` subset. - -.Deliverables -==== -- Show the number of refunds for each `STORE_R` value in the `refundsDF` subset. (For instance, `CENTRAL` stores had 2750 refunds.) -==== - -=== Question 3 (2 pts) - -In this file of beer reviews `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -Make a subset of the beers that have `(score != 5) & (overall == 5)` (in other words the `score` value is not equal to 5 but the `overall` value is equal to 5). How many lines of data are in this subset? - - -.Deliverables -==== -- How many lines of data are in the subset that has `(score != 5) & (overall == 5)` ? -==== - - -=== Question 4 (2 pts) - -Read in the 1980 election data using: - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - -There are only 9 entries in which the `TRANSACTION_DT` value is missing, namely: one donation from `CURCIO, BARBARA G` and two donations from `WOLFF, GARY W.` and six donations from *who?? (find their identity)!* Find the name of the person who made six donations in 1980 with a missing `TRANSACTION_DT`. - -.Deliverables -==== -- Find the name of the person who made 6 donations in 1980 with a missing `TRANSACTION_DT`. -==== - - -=== Question 5 (2 pts) - -Consider the 1990 flight data: - -`/anvil/projects/tdm/data/flights/subset/1990.csv` - -This data set has information about 5270893 flights. - -a. For how many flights is the `DepDelay` missing and also (simultaneously) the `ArrDelay` is missing too? - -b. For how many flights is the `DepDelay` given but the `ArrDelay` is missing? - -c. For how many flights is the `ArrDelay` given but the `DepDelay` is missing? - -.Deliverables -==== -- a. Find the number of flights for which the `DepDelay` is missing and also (simultaneously) the `ArrDelay` is missing too. - -- b. Find the number of flights for which the `DepDelay` is given but the `ArrDelay` is missing. - -- c. Find the number of flights for which the `ArrDelay` is given but the `DepDelay` is missing. -==== - - -== Submitting your Work - -We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours. - -.Items to submit -==== -- firstname_lastname_project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6-teachinglearning-backup.adoc deleted file mode 100644 index f9a8ffd4b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6-teachinglearning-backup.adoc +++ /dev/null @@ -1,183 +0,0 @@ -= TDM 10100: R Project 6 -- 2024 - -**Motivation:** In previous projects we've employed lists as the main way to store lots of data to a variable. However, R gives us access to plenty of other variable types that have their own benefits and uses and provide unique advantages in data analysis that are extremely important. In this project, we'll be exploring sets, nested lists, and named lists in R, focusing both on learning what they are and how to use them in a practical sense. - -**Context:** Understanding the basics of lists, looping, and manipulation of data in R dataframes will be crucial while working through this project. - -**Scope:** Lists, sets, nested lists, named lists, looping structures, R - -.Learning Objectives: -**** -- Know the differences between sets, nested lists, and named lists -- Know when to use each type of grouping variable, and common operations for each -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/youtube/USvideos.csv` - -== Questions - -=== Question 1 (2 pts) - -Let's jump right into new topics with _named lists_. Named lists (and the other structures we'll be covering in this project) are essentially lists with slight modifications to their structure/properties. Conceptually, you can think of named lists as lists, where each element in the list has an associated and unique name. For example, we may have a _named list_ of names and ages for people. The names, in this case, would be people's names, while the elements themselves would be their ages. An important thing to note is that keys should be unique, and thus using ages as the keys in our example would be much worse than using them as values. Take a look at the below code, where we make a named list and then print a couple of values. - -[source, r] ----- -# create a dictionary of names and ages -names_ages_list <- c("Mary Antoinette"= 23, "Charles Darwin"= 100, "Jimmy Hendrix"= 45, "James Cameron"= 69) - -# print the age of James Cameron -cat("James Cameron is", names_ages_list['James Cameron'], "years old") ----- - -For this problem, read the `/anvil/projects/tdm/data/youtube/USvideos.csv` data into an R dataframe called "US_vids". Print the head of that dataframe using `head()`. You'll notice a "category_id" column in the data. That could be useful! But we don't really have any idea what those numbers mean. In this question and the next, we'll create a new column in our dataframe that has the names of those categories. - -To do this, take a look at https://mixedanalytics.com/blog/list-of-youtube-video-category-ids/[this website]. Create a new named list called name_IDs where the names are the names of each category and the elements are the ID numbers of each category. Print the category ID for "Comedy" by indexing ino the named list (similar to how we did above). - -.Deliverables -==== -- The head of your new `US_vids` dataframe -- A named list with the names corresponding to each category. The IDs should be the list elements. -==== - -=== Question 2 (2 pts) - -Now that we have a dictionary that maps our IDs onto their names, we are ready to construct a new column in our dataframe. Luckily, R provides us with a super useful function to perform this lookup and name-element matching for us: `match()`. Read through https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/match[this documentation on the function], and use your `name_ids` list to create a new column in your dataframe called "category", with the name of the category corresponding to the ID in the pre-existing "category_id" column. Once you've done so, print the head of your dataframe to ensure that the new column has been added as you expect. (Hint: https://stackoverflow.com/questions/21422188/how-to-get-name-from-a-value-in-an-r-vector-with-names[This Stack Overflow post] may help guide you as well) - -Then use the https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table[`table()`] function to print a count of the different categories of our data, and sort it from most frequent to least frequent (Hint: We did this in Project 5, question 4). - -[NOTE] -==== -To validate your work, we will provide you the top 5 most frequent categories and how often they occurred: - -- Entertainment: 9964 -- Music: 6472 -- Howto & Style: 4146 -- Comedy: 3457 -- People & Blogs: 3210 -==== - -.Deliverables -==== -- A new column, `category` in the dataframe `US_vids` -- A count of categories in the data, sorted by most to least frequent -==== - -=== Question 3 (2 pts) - -Now that we've got a working understanding of named lists, let's talk about _sets_. If you're familiar with "set theory" in mathematics, you likely already know about these; if not, you're about to learn! A set is similar to a list in that it contains a series of elements. However, the main difference is that sets do not contain any duplicate elements, and they have no order. - -Sets are extremely useful in comparison with each other. For example, lets say I create two sets: A set of all my favorite colors and a set of all your favorite colors. If I wanted to see what colors were both my favorite and your favorite, I could find the "intersection" of those two sets. R has a handy method that does this (and other common set operations) for us. - -[IMPORTANT] -==== -Unlike languages like Python, R has no built-in 'set' data structure. However, if you do `mylist <- unique(mylist)` you essentially ensure that mylist is a functional set. Please use this prior to working with a list as a set in R. -==== - -In this problem, we want to figure out two things: - -. How often do videos with comments disabled have ratings disabled as well? -. What overlap is there between "comedy" videos and videos that have both comments and ratings disabled? - -As some guidance here, you could, for example, construct a set of videos that have comments disabled like so: - -[source, r] ----- -no_comment_vids <- unique(US_vids[which(US_vids$comments_disabled == "True"),]$video_id) ----- - -You could then use, for example, `intersect()` to compare this set to the set of videos with ratings disabled, and compare the total number of videos with comments disabled to those with comments and ratings disabled. (For a full list of set methods, https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html[click here]) - -[NOTE] -==== -If you wanted to easily get a set of videos with both comments and ratings on, you could use the intersection of the set of videos with comments on and the set of videos with ratings on. However, you could also get the difference between the set of all videos and the set of videos with either comments, ratings, or neither enabled, but not both. There are almost always multiple ways to solve things with sets. -==== - -.Deliverables -==== -- The proportion or percentage of videos with comments disabled that also have ratings disabled -- The proportion or percentage of "comedy" videos with both comments and ratings enabled -==== - -=== Question 4 (2 pts) - -Interesting. It looks like most comedy videos have most ratings and comments enabled. That makes sense, right? Comedians rely a lot on community feedback to improve their routines, so we would probably expect that they want to encourage things like leaving feedback and voting on whether they liked the video or not. However, we have a _LOT_ of categories in our data. Do you think this will hold for all the others? - -In this question, we want you to create a named list called `category_censorship` where the names are the names of the categories in our data, and the list elements are the percentage of videos in that category that have both comments and ratings enabled. We've provided some starter code for you below, and if you use your work from the last question the actual amount of new code you'll have to write will be minimal: - -[source, r] ----- -# create empty list -category_uncensored <- c() - -for (category in unique(US_vids$category)) { - # figure out how much of the category is uncensored using sets - # (Hint: This is very similar to the last problem) - - percent_uncensored <- # Fill this in as needed - - category_uncensored[category] <- percent_uncensored -} - -# fancy printing to make results look nicer -for (name in names(category_uncensored)) { - cat(name, "is", category_uncensored[name], "% uncensored\n") -} ----- - -Be sure to print your final results for the category. If you want to make things look better, you can try and sort your list based on percentage of censored videos, and even make pretty formatting for your printed results, but you don't need to in order to get full credit for this problem. - -.Deliverables -==== -- Your printed `category_censorship` list, defined as described above. -==== - -=== Question 5 (2 pts) - -Let's finish up the project by discussing nested lists. As we briefly discussed in the last project, lists can hold different types of data including, you guessed it, more lists! While this may seem convoluted and ridiculous, it is actually used all the time. Dataframes themselves are essentially, at their most basic, very similar to nested lists. - -One powerful utility of nested lists is organizing data in a tabular way, where, for example, the wrapping list is used as a list of the rows in our table, and the inner list is a list of each column of data for each row. - -For this question your task is to create your own table. Choose some subset of the `US_vids` dataframe (for example, comedy videos only) and create a table using nested lists for the rows and a list to store all the rows. Be sure that the first row in your table is made up of the column headers. - -To complete the question, run the relevant section of the below code to print out the first 5 entries of your table. - -[NOTE] -==== -If you're struggling at figuring out how to do this, take a look at https://stackoverflow.com/questions/14730001/converting-a-data-frame-to-a-list-of-lists[this post] for a good starting point. -==== - -[source, r] ----- -#print the first few rows of your table -head(mytable) ----- - -.Deliverables -==== -- A table of your own design that uses nested lists to store data -- The results of running the provided print statements -==== - -== Submitting your Work - -This project caps our section of the course on basic variable types and group-based variables in R. In closing out this project, we have learned the basic variable types available to us, common use cases for each, and how we can practically apply them in order to store, access, manipulate, and analyze data in an organized and efficient manner. - -In the next series of projects, we'll be diving into one of the deepest, most important parts of all of data science in R: functions. These upcoming projects will be an amalgamation of everything you've learned so far, and once you have functions under your belt you'll really have all the basic tools native R Python that you need. Be sure you understand everything so far, as the next projects will continue to challenge and expand on what we've learned. Never hesitate to reach out for assistance as needed. See you next week! - -.Items to submit -==== -- firstname_lastname_project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6.adoc deleted file mode 100644 index 4d1136239..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project6.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= TDM 10100: R Project 6 -- 2024 - -**Motivation:** Indexing in R is powerful and easy, and can be performed in several ways. - -**Context:** R indexes are often simply vectors of logical (TRUE/FALSE) values (but can also be positive or negative numbers, or can be some names). - -**Scope:** We will get familiar with several types of indexes for data in R. - -.Learning Objectives: -**** -- Learning about how to work with indexes in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` - - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - -[TIP] -==== -If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. -==== - -Example 1: - -++++ - -++++ - -Example 2: - -++++ - -++++ - -Example 3: - -++++ - -++++ - -Example 4: - -++++ - -++++ - -Example 5: - -++++ - -++++ - -Example 6: - -++++ - -++++ - -Example 7: - -++++ - -++++ - -Example 8: - -++++ - -++++ - - -=== Question 1 (2 pts) - -In the death records file: - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -For context: We can revisit Question 1 from Project 5 *without* using the `subset` command. Instead, we can use indexing, to illustrate the death age for women, as follows: - -`plot(table(myDF$Age[(myDF$Sex == "F") & (myDF$Age < 999)]))` - -and we can make a comparable plot for the deaths for men: - -`plot(table(myDF$Age[(myDF$Sex == "M") & (myDF$Age < 999)]))` - -(Notice how men die earlier, and also there is a bump in the number of deaths for men in their twenties and thirties.) - -OK, in this question, we want to make similar plots for the death age for a few different races. - -[TIP] -==== -You can see what races the data from the `Race` column represents, by looking at page 15 of the pdf source file: https://www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf -==== - - -a. Make a `table` of the values in the `Race` column and how many times that each `Race` value occurs. How many of the people in the data set have Filipino race? - -b. Use indexing (not the `subset` function) to make a plot of the table of the `Age` values at the time of death for which the `Race` value is the number 7 (which stands for Filipino race) and for which the `Age` is not 999. - - -.Deliverables -==== -- a. A `table` of the values in the `Race` column and how many times that each `Race` value occurs. Also, state how many of the people in the data set have Filipino race. -- b. Plot of the table of `Age` values for people with Filipino race, also with `Age` not equal to 999. -==== - - -=== Question 2 (2 pts) - -In the grocery store file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Let's re-examine Project 5, Question 2, as follows: Make a table of the values in the `myDF$STORE_R` column that satisfy the index condition `myDF$SPEND < 0`. In this way, you can re-create your answer to Project 5, Question 2, without using the `subset` function. - -.Deliverables -==== -- Show the number of refunds for each `STORE_R` value, just using indexing, in other words, without using the `subset` function. (For instance, `CENTRAL` stores had 2750 refunds.) -==== - -=== Question 3 (2 pts) - -In this file of beer reviews `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -a. Make a table of the values in the column `myDF$username` but do not print all of the values. Please `sort` the values and show only the `tail`, so that you can see the most popular 6 `username` values, and the number of reviews that each of these 6 people wrote. Hint: The user named `acurtis` wrote the most reviews! - -b. In part 3b, consider only the reviews written by the user `acurtis`. What is the average `score` of the reviews that were written by the user `acurtis`? - - -.Deliverables -==== -- a. Print the most popular 6 `username` values, and the number of reviews that each of these 6 people wrote. -- b. Find the average `score` of the reviews that were written by the user `acurtis`. -==== - - -=== Question 4 (2 pts) - -Read in the 1980 election data using: - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - -Revisit Question 4 from Project 5, and find the 9 `NAME` values for which the `TRANSACTION_DT` value is missing, using indexing instead of the `subset` function. - -.Deliverables -==== -- Give the 9 `NAME` values for which the `TRANSACTION_DT` value is missing, using indexing instead of the `subset` function. -==== - - -=== Question 5 (2 pts) - -Consider the 1990 flight data: - -`/anvil/projects/tdm/data/flights/subset/1990.csv` - -Using indexing (not the `subset` function) find the `mean` of the `DepDelay` of all of the flights whose `Origin` airport is `EWR` or `JFK` or `LGA`. - -.Deliverables -==== -- Give the `mean` of the `DepDelay` of all of the flights whose `Origin` airport is `EWR` or `JFK` or `LGA`. -==== - - -== Submitting your Work - -We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours. - -.Items to submit -==== -- firstname_lastname_project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7-teachinglearning-backup.adoc deleted file mode 100644 index 59e2ced2b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7-teachinglearning-backup.adoc +++ /dev/null @@ -1,212 +0,0 @@ -= TDM 10100: R Project 7 -- 2024 - -**Motivation:** Functions are the backbone of code. Whether you have a goal of building a complex internet server, designing and creating your own videogame, or analyzing enormous swaths of data instantly, you will need to have a strong working knowledge of functions to do it. Functions enable you to write more readable code, apply custom-made operations in novel ways, and overall are a necessity as a data scientist. In this project, we'll explore functions in R, and start to write our own as well! - -**Context:** Again, we'll be building off of all the previous projects here. A strong ability to work with lists and dataframes, analyze documentation and learn from it, and iterate through large amounts of data using a variety of approaches will set you up for success in this project - -**Scope:** Functions, R - -.Learning Objectives: -**** -- Learn what a function is -- Learn about a few common, high-utility functions in R -- Design and write your first functions -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/1987.csv` -- `/anvil/projects/tdm/data/beer/reviews.csv` - -[IMPORTANT] -==== -This project will be working with some larger datasets, so please be sure to reserve 3 cores when starting your Jupyter session. If you are getting a "kernel died" message when you attempt to run your code, feel free to increase it to 4 cores. If your kernel is still dying, please reach out to someone on the TDM team for assistance. You should not reserve more than 4 cores for this project. -==== - -== Questions - -=== Question 1 (2 pts) - -Let's begin by discussing functions. While we've already used numerous functions throughout this semester, we have not yet taken the time to explore what is going on behind the scenes. A function is, at its most basic, some code that takes input data, and returns output data. We often refer to the input data as 'parameters' or 'arguments' and the output data 'return values' or simply 'outputs'. The usefulness of functions comes from their reusability. If we need to do the same action at different points in the code, we can define a function that performs that action and use it repeatedly to make our code cleaner and more readable. - -Functions are first **defined**: their name, the number of arguments and the name of each argument, the code that is performed each time the function is called, and what values are returned are defined in this part of the code. Then, when we want to use a function, we **call** it by writing its name along with any arguments we want to give it. - -Let's look at a brief example below, demonstrating how a function is defined, the inputs it takes, and the value it returns. Please copy this into your Jupyter notebook and experiment with it to really get a feel for how things work before attempting to complete this question. Pay special attention to the comments dissecting each part of the code in detail if you are still having trouble understanding the program flow - -[source, r] ----- -# Nothing in between the equals signs gets run until -# the function is called!! -# ===================================================== -# define a function called foo, that takes two -# arguments: bar and buzz -foo <- function(bar, buzz) { - cat("Hello", bar, " how are you?\n") - return (buzz * 10) -} -# ===================================================== - -# call the function with our own arguments, and -# store the output to funcOut1 -funcOut1 <- foo("Jackson", 20) -cat(funcOut1, "\n") - -# we can also pass in defined variables as arguments, -# like so: -var1 <- "Jimbob" -var2 <- 13 -funcOut2 <- foo(var1, var2) -cat(funcOut2, "\n") ----- - -For this question, we want you to define your own function called `is_leap()` that takes one variable, called `year`, as input, and returns `TRUE` if `year` is a leap year and `FALSE` if year is not a leap year. (Hint: You should already have the code to do this in project 2, you just have to turn it into a function!!) - -[NOTE] -==== -Here are some test cases for you to use to double-check that your code is working as expected. -- 1896, 2000, 2004, 2008, and 2024 are all leap years -- 1700, 1900, and 2010 are all not leap years -==== - -.Deliverables -==== -- A function, `is_leap()`, that returns a boolean dictating whether or not a function is a leap year or not -==== - -=== Question 2 (2 pts) - -Awesome. We now know in a real sense what a function is, how to define it, and how to use it in our code. Let's keep building on this by reading in some data and learning how we can apply functions to dataframes all at once! - -[IMPORTANT] -==== -If you missed the note at the top of the project, I will reiterate here once more: you will very likely need to use at least 3 cores for this project. If you get a "kernel died" message, try using 3-4 cores instead of 2. If your kernel is still dying at 4 cores, please reach out to someone on the TDM team for assistance. -==== - -First off, read the "/anvil/projects/tdm/data/flights/1987.csv" data into a new dataframe called `flights_1987`. Print the head of the dataset. - -You should notice a column called "DayOfWeek". Write a function called `dayNamer()` that, given a number for day of the week, returns the name of that day. Run it on at least 3 different rows of the data to verify that it works. (Hint: DayOfWeek depicts Monday as 1, Tuesday as 2, and so on.) - -[NOTE] -==== -The first 5 days in the data, in order, are Friday, Saturday, Thursday, Friday, Saturday. You can use this to test your function. -==== - -You can use the below code to test your function: - -[source, r] ----- -for (i in seq(from=1, to=5, by=1)) { - cat("Day", i, ":", dayNamer(flights_1987$DayOfWeek[i]), "\n") -} ----- - -.Deliverables -==== -- a function called `dayNamer` that takes as input a number for the day of the week and returns as output a string that is the name of the day. -==== - -=== Question 3 (2 pts) - -Great, we now have a function that converts a day number into a day name. Let's use this function to create a new column, "day_name", in our dataframe. However, there is a caveat: this is a **LOT** of data. If you try and iterate through it all with a for loop, your kernel will -very likely die (or at the least run very slowly). - -Introducing: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/lapply[`lapply()`]. `lapply()` will allow us to apply a function to every row in a dataframe in a vectorized and efficient manner. Take a look at the below example, where I use this method to create a new column in our data using a nonsensical function I've written (very similar to what you have to do for this question). - -[source, r] ----- -scaler <- function(year) { - return (year * 1000) -} - -flights_1987$nonsense_years <- lapply(flights_1987$Year, scaler) -head(flights_1987) ----- - -[NOTE] -==== -You can check your work here by printing the head of your dataframe and making sure the first 5 days match as expected in the previous question. -==== - -.Deliverables -==== -- A new column, "day_name" in the dataframe, generated using your `dayNamer()` function and `lapply()`, that is the names of each day corresponding to the pre-existing 'DayOfWeek' column -==== - -=== Question 4 (2 pts) - -Now that we've got a good grasp on functions, let's continue to learn by diving into some new combinations of functions we've explored previously. First, use `table()` to get a count of how many times each day occurs in the data (using the 'day_name' column you made in the last question). Then, use `length()` and division to figure out what percentage of the days in our data are each day of the week. Your final result should contain printed output with what proportion (or percentage) of our data occurred on each day of the week. Do not use any looping to solve this problem, as it will be both significantly slower and defeat the purpose of using `table()` and `length()`. - -[NOTE] -==== -We've now used https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table[`table()`] and https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/length[`length()`] in multiple projects, but feel free to refer back to their docs pages if necessary. -==== - -[IMPORTANT] -==== -If `table()` is giving you a weird output, try using `sapply()` instead of `lapply()`. More on this issue https://stackoverflow.com/questions/66629593/weird-r-issue-with-table[here]. -==== - -.Deliverables -==== -- The proportion of each day in our dataset, printed out. -==== - -=== Question 5 (2 pts) - -For this last question, we'll start getting into the more complex functions that we'll be spending lots of time on in the next few projects. The function you will write for this question is as follows: - -- called `prop_table_maker()` -- Takes two arguments, a dataframe and a column -- Returns a table of the proportions of each value in that column - -If you're struggling with where to start, try and approach this problem like so: - -. First, write some code to do this on a specific dataframe and column of your choice (Hint: We did this in the last problem!) -. Next, wrap that code in a function definition, and replace the dataframe and column you chose with your function arguments as needed. -. Finally, be sure that you are returning a table as expected, and test your function a few times with known results. - -Finally, run the following code: - -[source, r] ----- -# read in some beer review data -beer_reviews <- read.csv("/anvil/projects/tdm/data/beer/reviews.csv") - -# get a table of user proportions -top_users <- sort(prop_table_maker(beer_reviews, "username"), decreasing = TRUE) - -# print the top 5 users in the data -cat(names(top_users[1:5])) ----- - -Which should have an output like this if you did everything correctly: - -`Sammy kylehay2004 acurtis StonedTrippin jaydoc` - -.Deliverables -==== -- The `prop_table_maker()` function as described above -- The results of running the provided testing code using your `prop_table_maker()` function -==== - -== Submitting your Work - -Congratulations, you've finished your first in-depth project on functions in R! Going forward, you should be getting quite comfortable in writing your own functions to analyze data, perform calculations, and otherwise simplify repetitive tasks in your code. You should also be able to differentiate between methods and functions, and understand what notation you should use when calling something based on whether it is a function or a method. - -In the next project, we'll finish up our exploration of functions in R, and begin exploring visualizing data and analyzing it to create good summary statistics and graphics. - -.Items to submit -==== -- firstname_lastname_project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7.adoc deleted file mode 100644 index 31eeee456..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project7.adoc +++ /dev/null @@ -1,254 +0,0 @@ -= TDM 10100: R Project 7 -- 2024 - -**Motivation:** We continue to learn about vectorized operations in R. - -**Context:** Many functions and methods of indexing in R are much more powerful and easy to use (as compared to other tools).. - -**Scope:** We will get familiar with several more types of vectorized operations in R. - -.Learning Objectives: -**** -- Vectorized operations in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` - - -Example 1: - -++++ - -++++ - -Example 2: - -++++ - -++++ - -Example 3: - -++++ - -++++ - -Example 4: - -++++ - -++++ - -Example 5: - -++++ - -++++ - -Example 6: - -++++ - -++++ - -Example 7: - -++++ - -++++ - -Example 8: - -++++ - -++++ - -Example 9: - -++++ - -++++ - -Example 10: - -++++ - -++++ - - - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - -[TIP] -==== -If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. -==== - - - -=== Question 1 (2 pts) - -In the death records file: - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -Use the `cut` command to classify people at their time of death into 5 categories: - - -[source, bash] ----- - "youth": less than or equal to 18 years old - - "young adult": older than 18 but less than or equal to 25 years old - - "adult": older than 25 but less than or equal to 35 years old - - "middle age adult": older than 35 but less than or equal to 55 years old - - "senior adult": greater than 55 years old but less than or equal to 150 years old (or any other upper threshhold that you like) - - "unknown": age of 999 (you could use, say, ages 150 to Inf for this category) ----- - -a. First wrap the results of your `cut` function into a table. - -b. In the `cut` function, add labels corresponding to the 6 categories above. - -c. Now wrap the table into a `barplot` that shows the number of people in each of the 6 categories above. - - -.Deliverables -==== -- a. A table showing how many people are in each of the 5 categories above at the time of their death. (The labels for part a should be the default labels, i.e., like this: `(-Inf,18] (18,25] (25,35] (35,55] (55,150] (150, Inf]` - -- b. Same table output as in part a but now also adding labels corresponding to the 6 categories above. - -- c. A `barplot` that shows the number of people in each of the 6 categories above. - -==== - - -=== Question 2 (2 pts) - -In the grocery store file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Use the `tapply` function to sum the values from the `SPEND` column, according to 8 categories, namely, according to whether the `YEAR` is 2016 or 2017, and according to whether the `STORE_R` value is `CENTRAL`, `EAST`, `SOUTH`, or `WEST`. - - -.Deliverables -==== -- Show the sum of the values in the `SPEND` column according to the 8 possible pairs of `YEAR` and `STORE_R` values. -==== - -=== Question 3 (2 pts) - -In this file of beer reviews `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -Use `tapply` to categorize the mean `score` values in each month and year pair. Your `tapply` should output a table with years as the row labels and the months as the column labels. - -.Deliverables -==== -- Print a table displaying the mean `score` values for each month and year pair. -==== - - -=== Question 4 (2 pts) - -Read in the 1980 election data using: - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - -In this question, we do not care about the dollar amounts of the election donations. In other words, do not pay any attention to the `TRANSACTION_AMT` column. Only pay attention to the number of donations. There is one donation per row in the data set. - -a. Using the `subset` function to get a data frame that contains only the donations for which the `STATE` is `IN`. From the `CITY` column of this `subset`, make a `table` of the number of occurrences of each `CITY`. Sort the table and print the largest 41 entries. - -b. Same question as part a, but this time, do not use the `subset` function. Instead, consider the elements from the `CITY` column for which the `STATE` value is `IN`. Amongst these restricted `CITY` values, make a `table` of the number of occurrences of each `CITY`. Sort the table and print the largest 41 entries. (Your result from question 4a and 4b should look the same, but using these two different methods.) - -c. Find at least one strange thing about the top 41 entries in your result. - -.Deliverables -==== -- a. Using the `subset` function, give a `table` of the top 41 cities in Indiana, according to the number of donations from people in that city. -- b. Using indexing (not a `subset`), give a `table` of the top 41 cities in Indiana, according to the number of donations from people in that city. -- c. Find at least one strange thing about the top 41 entries in your result. -==== - - -=== Question 5 (2 pts) - -Consider the 1990 flight data: - -`/anvil/projects/tdm/data/flights/subset/1990.csv` - -The `DepDelay` values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed. - -Use the `cut` command to classify the number of flights in each of these categories: - -`Flight departed early or on time, i.e., DepDelay is negative or 0.` - -`Flight departed more than 0 but less than or equal to 60 minutes late.` - -`Flight departed more than 60 but less than or equal to 120 minutes late.` - -`Flight departed more than 120 but less than or equal to 180 minutes late.` - -`Flight departed more than 180 but less than or equal to 240 minutes late.` - -`Flight departed more than 240 but less than or equal to 300 minutes late.` - -Etc., etc., and finally: - -`Flight departed more than 1380 but less than or equal to 1440 minutes late.` - -Make a `table` that shows the number of flights in each of these categories. - -Use the `useNA="always"` option in the `table`, so that the number of flights without a known `DepDelay` is also given. - -[NOTE] -==== -In the `cut` command, the output will look nicer if you use the option `dig.lab = 4`. -==== - - -.Deliverables -==== -- Give the table described above, which classifies the number of flights according to the number of hours that the flights are delayed. -==== - - -== Submitting your Work - -You now are knowledgeable about a wide range of R functions. Please continue to practice and to ask good questions~ - -.Items to submit -==== -- firstname_lastname_project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8-teachinglearning-backup.adoc deleted file mode 100644 index 9344b72fd..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8-teachinglearning-backup.adoc +++ /dev/null @@ -1,249 +0,0 @@ -= TDM 10100: R Project 8 -- 2024 - -**Motivation:** In the previous project, we cemented the formal concept of functions and wrote our first functions. We also learned about more complicated functions with multiple arguments, and developed some versatile functions that we can apply to real data. In this project, we're going to learn how to build functions that we can apply to entire datasets at once and more! - -**Context:** The previous project on functions, along with the prior knowledge of data science we've covered so far in the course - -**Scope:** Advanced functions, apply, sapply, lapply, and tapply - -.Learning Objectives: -**** -- Be able to develop versatile, powerful functions -- Learn to apply transform whole columns or dataframes at once -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/craigslist/vehicles.csv` - -[IMPORTANT] -==== -You will likely need 3 cores for this project, but should not need more than 4. -==== - -== Questions - -=== Question 1 (2 pts) - -Let's start out by reading in the data at `/anvil/projects/tdm/data/craigslist/vehicles.csv` into a dataframe (name it whatever you'd like), and then print the head of that dataframe. Take a few minutes to explore and acquaint yourself with the data. - -We're going to provide a bit less structure for this question so that you can use this as an opportunity to test what you've learned so far, but the previous projects outline all of the concepts necessary to complete it. If you are struggling, feel free to read the note down below that details the specific projects to look back at if you are struggling with developing an approach to this question. - -For this question, do two main things: -. print the head of the dataframe -. get a sorted (in descending order) table of the top 10 vehicle models for the year 2018 - -[NOTE] -==== -- For a reminder about with logical indexing, revisit Project 3, Question 4 -- For a reminder about `table()`, visit this https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table[docs page] -==== - -To validate your answers, refer to the below list to see the top 3 models for this year, and their frequency: - -- f-150: 862 -- silverado 1500: 450 -- escape: 379 - -.Deliverables -==== -- The head of your vehicles dataframe -- A descending-order sorted table of the top 10 vehicle models for the year 2018 -==== - -=== Question 2 (2 pts) - -Now that we have some logic to get us the most popular model of vehicle for the given year, let's make a function that uses this logic to tell us interesting information! Taking as input a year, your function should return a table of the 20 most popular vehicle models for that year. - -Here is a basic function signature to get you started: - -[source, r] ----- -popular_models_for_year <- function(year) { - # insert your function's logic here -} ----- - -To validate your function, print the results of running it on the year 2018. Is it the same as the results you got in the last question? If not, you should revisit your logic and see why. - -Next, make a new function named `top_model_for_year()` that returns only the name of the most popular vehicle model for that year. If you aren't sure where to start with this, think about it this way: we can use the code from the function we just wrote, but now, instead of the top 10 models, we only want the top 1. We can use `names()` to retrieve the names in a table, like so: `names(mytable)` returns c(name1, name2, ...). - -There are a few different ways you could do this (i.e. indexing, changing arguments to `head()`), and any are acceptable as long as they return the expected top model. - -Again, test your function on 2018. Does it return 'f-150' as expected? - -.Deliverables -==== -- a function `popular_models_for_year()` that gets a table of the top 10 vehicle models for a given year -- a function `top_model_for_year()` that gets the name of the most popular vehicle for a given year. -==== - -=== Question 3 (2 pts) - -Fantastic. Now that we've created a function that can easily get us the most popular vehicle for a given year, let's incorporate this information back into our dataframe and learn about a useful family of functions in the meantime. - -The 'apply' functions in R are super useful, and each provide some different version of applying a function over a whole dataframe. Give the below (non-exhaustive) list of apply-family functions a once-over, paying special attention to some of the provided examples, before attempting to complete the task for this question. - -- apply: runs a function across every row (or column) of a dataframe, returning a list of results -- lapply/sapply: runs a function on a list/vector, returning a list/array/dataframe of results of the same length as the input list -- tapply: runs a given function on every value in a column, grouping by another specified column (this one is tough to understand without looking at an example) - -[NOTE] -==== -`lapply()` and `sapply()` are extremely similar, and largely differ in the type of object that they return. `lapply()` returns a list of values, where `sapply()` returns an array/dataframe of values. In many cases, they can be used relatively interchangeably. -==== - -[source, r] ----- -# create an example dataframe with some playing cards -cards_df <- data.frame(card_name=c("Ace","King","Queen","Jack","3","2","1"), - card_suit=c("Clubs","Clubs","Hearts","Spades","Diamonds","Diamonds","Diamonds"), - lowest_card_cost=c(15,13,10,9,3,2,1), - highest_card_cost=c(500,200,100,90,30,20,10)) -head(cards_df) - -# apply example: -# gets mean card value across columns (mean lowest and highest card costs) -print("Average lowest and highest card cost:") -apply(cards_df[3:4], 2, mean) - -# lapply example: -# converts all our costs from USD to CAD -usd_to_cad <- function(usd) { - return (usd * 1.36) -} -print("Canadian lowest card value") -lapply(cards_df$lowest_card_cost, usd_to_cad) - -# tapply example: -# show the mean highest cost of each suit type in our dataframe -print("Mean Highest Cost by Suit:") -tapply(cards_df$highest_card_cost, cards_df$card_suit, mean) ----- - -[NOTE] -==== -For additional explanation on these functions, including more examples, please refer to https://www.geeksforgeeks.org/apply-lapply-sapply-and-tapply-in-r/[this wonderful GeeksforGeeks article]. -==== - -Your task for this question is straightforward: Print the most commonly occurring value for each column of our data, using `apply()` in conjunction with a function that takes as input a column and returns as output the most commonly occurring value for that column. If you'd like to try and create this function on your own, https://stackoverflow.com/questions/12187187/how-to-retrieve-the-most-repeated-value-in-a-column-present-in-a-data-frame[this stackoverflow post] is a fantastic resource. Otherwise, you can make some _slight_ changes to the function that you wrote previously during this project that gets the most common vehicle for a given year in order to get the most common value in a given column. (Hint: You don't have to consider the year, or any other factors, in this function. It is just the column as a whole that needs to be analyzed) - -[NOTE] -==== -If you're struggling with this question, take a look back at the previous question's work. Is there a way we can use a `sorted()` `table()` to figure out the most common value? -==== - -.Deliverables -==== -- The most common value in each column of the vehicles dataframe -==== - -=== Question 4 (2 pts) - -Let's continue on our journey with the 'apply' family with `lapply()`. To recap what was discussed previously, lapply takes a given column and applies a function to each element in that column, returning a vector of the same length as the original column, containing the results of the function for each element. Again, the example provided in the last question (and provided again below, for your convenience) is a great way to test this and see it in action. - -[source, r] ----- -# create an example dataframe with some playing cards -cards_df <- data.frame(card_name=c("Ace","King","Queen","Jack","3","2","1"), - card_suit=c("Clubs","Clubs","Hearts","Spades","Diamonds","Diamonds","Diamonds"), - lowest_card_cost=c(15,13,10,9,3,2,1), - highest_card_cost=c(500,200,100,90,30,20,10)) -head(cards_df) - -# lapply example: -# converts all our costs from USD to CAD -usd_to_cad <- function(usd) { - return (usd * 1.36) -} -print("Canadian lowest card value") -lapply(cards_df$lowest_card_cost, usd_to_cad) ----- - -For this question, we want you to use lapply() to create a new column `popular_model_by_year` containing the most popular model and manufacturer, respectively, for the year of the given row. We've provided some basic instructions for the given task, along with some skeleton code for you to start from, below. - -. First, get a list of unique years in the data -. Then, use your `top_model_for_year()` function from earlier to create a list where the names are the `as.character(year)` and the values are the top model for the given year -. Define a mapping function, year_to_model, that takes as input a year and returns as output the model with that year as a name from the list you just created -. Define a new column, `year_top_model()`, in your dataframe, that is filled in by using lapply and your mapping function `year_to_model()`. -. Print two rows of the data to check your work - -[source, r] ----- -# HINT: You can use unique() to get a list of all the years -years <- # FILL THIS IN - -# use your "top_model_for_year()" function to get the model for each year and add it to a named list -# you're aiming to get a list like c("2018": "f-150", "1900": "model T") -models_years <- setNames(as.vector('''use lapply here to get the top models for each year'''), as.character(years)) - -# map year to model using our models_years function -year_to_model <- function(year) { - return # FILL THIS IN -} - -# use lapply with your year_to_model function -vehicles$year_top_model <- # FILL THIS IN - -# Check your answers! -print(vehicles[29, "year_top_model"]) # => This should print "f-150" -print(vehicles[30, "year_top_model"]) # => This should print "ranger supercab xl pickup" ----- - -.Deliverables -==== -- Use lapply to create the `year_top_model` column as specified above -- Run the provided `Check your answers!` section and be sure they match the expected results -==== - -=== Question 5 (2 pts) - -Let's finish this project out with one of the most useful functions in the `apply` family, and one of the more useful R functions as a whole (especially for data scientists): `tapply()`. `tapply()` is extremely powerful, as it allows us to run grouping functions like `mean()` on a specific column of a dataframe, grouping by another column in the dataframe. For example, `tapply(mydf$price, mydf$condition, median)` will return the median price for each condition in your dataframe. Take a look at the provided example below, and try running it on your own so that you understand tapply. Once you've done this, you can start the task for this question. - -[source, r] ----- -# create an example dataframe with some playing cards -cards_df <- data.frame(card_name=c("Ace","King","Queen","Jack","3","2","1"), - card_suit=c("Clubs","Clubs","Hearts","Spades","Diamonds","Diamonds","Diamonds"), - lowest_card_cost=c(15,13,10,9,3,2,1), - highest_card_cost=c(500,200,100,90,30,20,10)) -head(cards_df) - -# tapply example: -# show the mean highest cost of each suit type in our dataframe -print("Mean Highest Cost by Suit:") -tapply(cards_df$highest_card_cost, cards_df$card_suit, mean) ----- - -Using `tapply()`, calculate the average vehicle price in our dataframe for each given year and then again for each given state. Are there any results you found interesting or surprising? In a markdown cell, write 3-4 sentences about your results analyzing the trends in each result and noting any surprising features of the data you've discovered. - -.Deliverables -==== -- The average vehicle price for each year in our data -- The average vehicle price for each state in our data -- At least 3-4 sentences analyzing the results of your `tapply()` operations -==== - -== Submitting your Work - -This project was quite complex, and it is okay to struggle and stop to finish your work later, or even to try and do the project again another day to review and continue to grow your understanding. The `apply()` family of functions, and the concept of a function in general, are extremely valuable tools to have in any field, and mastering the ability to write modular functions and apply them to entire dataframes at once to improve or extend on existing data analyses will take your skills to the next level. - -Next week we will take a step back and start to look at packages in R, which are essentially collections of functions that others have written that can **DRASTICALLY** reduce your workload and make your code much faster. As you've likely come to learn, reducing repition is one of the key paradigms of good code, and packages are a fantastic method towards this end. - -Take a break, drink some water, enjoy some fresh air, and I hope you all have a great rest of your week. I look forward to the opportunity to learn with you all on the next project. - -.Items to submit -==== -- firstname_lastname_project8.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8.adoc deleted file mode 100644 index 8063eaf56..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project8.adoc +++ /dev/null @@ -1,229 +0,0 @@ -= TDM 10100: R Project 8 -- 2024 - -**Motivation:** We will learn about how user-defined functions work in R. - -**Context:** Although R has lots of built-in functions, we can design our own functions too! - -**Scope:** We start with some basic functions, just one line functions, to demonstrate how powerful they are. - -.Learning Objectives: -**** -- User-defined functions in R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- `/anvil/projects/tdm/data/election/itcont1980.txt` -- `/anvil/projects/tdm/data/flights/subset/1990.csv` -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -Example 1: - -Finding the average weight of Olympic athletes in a given country. - -[source,R] ----- -avgweights <- function(x) {mean(myDF$Weight[myDF$NOC == x], na.rm = TRUE)} ----- - -++++ - -++++ - -Example 2: - -Finding the percentages of school metro types in a given state. - -[source,R] ----- -myschoolpercentages <- function(x) {prop.table(table(myDF$"School Metro Type"[myDF$"School State" == x]))} ----- - -++++ - -++++ - -Example 3: - -In the 1980 election data, finding the sum of the donations in a given state. - -[source,R] ----- -mystatesum <- function(x) {sum(myDF$TRANSACTION_AMT[myDF$STATE == x])} ----- - -++++ - -++++ - -Example 4: - -Finding the average number of stars for a given author of reviews. - -[source,R] ----- -myauthoravgstars <- function(x) {mean(myDF$stars[myDF$author == x])} ----- - -++++ - -++++ - - - -== Questions - -[IMPORTANT] -==== -As before, please use the `seminar-r` kernel (not the `seminar` kernel). You do not need to use the `%%R` cell magic. -==== - - - -=== Question 1 (2 pts) - -Consider this user-defined function, which makes a table that shows the percentages of values in each category: - -[source,R] ----- -makeatable <- function(x) {prop.table(table(x, useNA="always"))} ----- - -If we do something like this, with a column from a data frame: - -[source,R] ----- -makeatable(myDF$mycolumn) ----- - -Then it is the same as running this: - -[source,R] ----- -prop.table(table(myDF$mycolumn, useNA="always")) ----- - -In other words, `makeatable` is a user-defined function that makes a table, including all `NA` values, and expresses the result as percentages. That is what the `prop.table` does here. - -Now consider the DeathRecords data set: - -`/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -a. Try the function `makeatable` on the `Sex` column of the DeathRecords. - -b. Also try the function `makeatable` on the `MaritalStatus` column of the DeathRecords. - - -.Deliverables -==== -- Use the `makeatable` function to display table of values from the `Sex` column of the DeathRecords. - -- Use the `makeatable` function to display table of values from the `MaritalStatus` column of the DeathRecords. -==== - - -=== Question 2 (2 pts) - -Define a function called `teenagecount` as follows: - -[source,R] ----- -teenagecount <- function(x) {length(x[(x >= 13) & (x <= 19) & (!is.na(x))])} ----- - -a. Try this function on the `Age` column of the DeathRecords. - -b. Also try this function on the `Age` column of the file `/anvil/projects/tdm/data/olympics/athlete_events.csv` - -.Deliverables -==== -- Display the number of teenagers in the DeathRecords data. -- Display the number of teenagers in the Olympics Athlete Events data. -==== - -=== Question 3 (2 pts) - -The `nchar` function gives the number of characters in a string. The `which.max` function finds the position of the maximum value. Define the function: - -[source,R] ----- -longesttest <- function(x) {x[which.max(nchar(x))]} ----- - -a. Use the function `longesttest` to find the longest review in the `text` column of the beer reviews data set `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -b. Also use the function `longesttest` to find the longest name in the `NAME` column of the 1980 election data: - -[source, R] ----- -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="") -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - - -.Deliverables -==== -- Print the longest review in the `text` column of the beer reviews data set `/anvil/projects/tdm/data/beer/reviews_sample.csv` -- Print the longest name in the `NAME` column of the 1980 election data. -==== - - -=== Question 4 (2 pts) - -a. Create your own function called `mostpopulardate` that finds the most popular date in a column of dates, as well as the number of times that date occurs. - -b. Test your function `mostpopulardate` on the `date` column of the beer reviews data `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -c. Also test your function `mostpopulardate` on the `TRANSACTION_DT` column of the 1980 election data. - -.Deliverables -==== -- a. Define your function called `mostpopulardate` - -- b. Use your function `mostpopulardate` to find the most popular `date` in the beer reviews data `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -- c. Also use your function `mostpopulardate` to find the most popular transaction date from the 1980 election data. -==== - - -=== Question 5 (2 pts) - -Define a function called `myaveragedelay` that takes a 3-letter string (correspding to an airport code) and finds the average departure delays (after removing the NA values) from the `DepDelay` column of the 1990 flight data `/anvil/projects/tdm/data/flights/subset/1990.csv` for flights departing from that airport. - -Try your function on the Indianapolis "IND" flights. In other words, `myaveragedelay("IND")` should print 5.96977225672878 because the flights with `Origin` airport "IND" have an average departure delay of 5.9 minutes. - -Try your function on the New York City "JFK" flights. In other words, `myaveragedelay("JFK")` should print 11.8572741063607 because the flights with `Origin` airport "JFK" have an average departure delay of 11.8 minutes. - -.Deliverables -==== -- a. Define your function called `myaveragedelay` - -- b. Use `myaveragedelay("IND")` to print the average departure delays for flights with Origin airport "IND". - -- c. Use `myaveragedelay("JFK")` to print the average departure delays for flights with Origin airport "JFK". -==== - - -== Submitting your Work - -Now you know how to write your own functions! Please let us know if you need assistance with this project. - - -.Items to submit -==== -- firstname_lastname_project8.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9-teachinglearning-backup.adoc deleted file mode 100644 index ef62620de..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9-teachinglearning-backup.adoc +++ /dev/null @@ -1,181 +0,0 @@ -= TDM 10100: R Project 9 -- 2024 - -**Motivation:** Knowing how to write our own functions in R truly opens up a lot of functionality to us, and allows us to design and create a lot of more complicated operations. Code that those other people write can be made available in "packages" and used by others, and being able to leverage packages in R will truly boost your data abilities. In this project, we'll explore some commonly used packages in R and how to use them along with introducing a few packages that we'll focus on in greater detail in the coming weeks. - -**Context:** Previous syntax for functions along with more basic control and looping structures in R will be useful for this project. - -**Scope:** Packages, data.table, tidyverse, dplyr, tidyr, ggplot2, stringr, lubridate - -.Learning Objectives: -**** -- Learn what a package is -- Learn how to import a package and check its existence and version -- Learn how to use packages and resolve conflicts between packages -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/*` - -== Questions - -=== Question 1 (2 pts) - -As mentioned in the introduction to this project, code is almost always a collaborative effort that involves relying and building on the work that came before us. In R, packages is how we do this. At a high-level, you can conceptualize packages as bundles of pre-written functions, object definitions, and other code that we can use in our own projects to lighten our own workload. In this question, let's read in some data, load our first package, and prepare for the rest of the project. - -For this question, start by loading the `data.table` package by running - -[source, r] ----- -library(data.table) ----- - -As simple as that, we now have access to all the various code that `data.table` has to offer! (For a comprehensive overview of data.table, refer https://cran.r-project.org/web/packages/data.table/index.html[here]) You can think of `data.table` as an efficient, fast extension to the `data.frame` objects we've used thus far in the class. - -The `movies_and_tv` directory that we'll be working with this during this project contains several different `.csv` files that we'll be looking at, namely, `titles.csv`, `episodes.csv`, `people.csv`, and `ratings.csv` respectively. Because we'll be reading in so much data, our trusty `read.csv()` function just won't do. Instead, we'll be using the _wicked fast_ `fread()` function from the data.table library we just imported. Below is the required code to read the data in the 'titles.csv' file into a data.table called titles. In order to get full credit for this problem, please read in the four files mentioned above to data.tables with names that match the files they come from (i.e. read "episodes.csv" into "episodes"), and print the head of each one. - -[source, r] ----- -# read -titles <- fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv") ----- - -[NOTE] -==== -One important thing to note is that the objects that we just read our data into are `data.table` objects, and not the `data.frame` objects we've been using thus far. This won't be a problem, as `data.table` has relatively the same functionality as `data.frame`, but it is good to recognize that they are different types. -==== - -.Deliverables -==== -- 4 new data.tables, and their heads, named titles, episodes, ratings, and people -==== - -=== Question 2 (2 pts) - -There's a lot of data here, so for the sake of not killing our kernel, let's work with a subset of it. Let's narrow down our data to just episodes of one of (in this author's opinion) the best T.V. shows of all time: "Game of Thrones" (at least until the last 2 seasons). First off, you can recognize that the ID for the show is "tt0944947". (If you're wondering how I got this, visit the IMDb page for the show https://www.imdb.com/title/tt0944947/?ref_=fn_al_tt_1[here] and take a look at the URL). - -Use logical indexing the same way we have in previous projects to get just the rows in our `episodes` table where the "show_title_id" is "tt0944947", and store it to a new data.table called GoT_dat, and print the head. - -Next, use a function from a previous project to get a count of how many episodes there are in each season using our GoT_dat. As we've already done this a few times in this class, we're not going to put exactly what you need to do for this, but feel free to take a look at the note below if you need a small hint. - -[NOTE] -==== -If you're having trouble with this question, take a look at Project 8, Question 1 for a reminder on getting counts of each value. The "season_number" column should help with this. -==== - -.Deliverables -==== -- A new data.table, `GoT_dat`, of the "Game of Thrones" episodes -- The number of episodes in each season of "Game of Thrones" -==== - -=== Question 3 (2 pts) - -We've already got a pretty useful subset of our data, but putting all these episode_title_id's into our URLs every time we want to see the name of an episode is super tedious. Let's fix this with a new function, `merge()`. - -[NOTE] -==== -`merge()` is a function in both standard R and `data.table`, and the `data.table` version is often faster. This is our first example of "package conflict". In this case, the conflict will resolve itself because we are storing our data in a `data.table`, so our kernel will know to use the `data.table merge()` automatically. -==== - -`merge()` takes a few arguments. Take a look at the below example - -[source, r] ----- -# table 1 is the first table we want to merge -# table 2 is the second table we want to merge -# the tables will merge by lining up the two columns specified as -# "by.x" for the column name in table1, -# and "by.y" for the column name in table2 -merged_table <- merge(table1, table2, by.x="table1_column", by.y="table2_column") ----- - -For this problem, merge your `GoT_dat` data.table with the titles data.table and store the result back to your `GoT_dat`. Print the head to be sure you made the changes you wanted to. - -Then, do the same thing but with the ratings table so that we get the review score of each episode. - -Print the head of your final `GoT_dat` data.table to be sure that you made the changes you wanted. - -.Deliverables -==== -- The head of your `GoT_dat` data.table, but now with the titles and ratings tables merged in. -==== - -=== Question 4 (2 pts) - -Let's explore another useful improvement in `data.table`: subsetting. While this won't technically be a huge increase in functionality or speed, especially with the rather small amount of data we're currently looking at, its important to note that it _is_ noticeably cleaner than regular indexing in addition to being noticeably faster at-scale. - -`data.table` subsetting essentially allows us to do the logical indexing we've been doing for a while now, without having to repeat the name of our data.table over and over again. As an example, take a look at the below (and consider running it on your own to see how it works): - -[source, r] ----- -# regular R subsetting -GoT_df[GoT_df$episode_number > 8 & GoT_df$votes > 20000] - -# _fancy_ new data.table subsetting -GoT_df[episode_number > 8 & votes > 20000] ----- - -Use data.table subsetting to get the episodes with a rating of at least 8.5, then figure out the mean runtime in minutes of those well-rated episodes. - -.Deliverables -==== -- The average runtime of Game of Thrones episodes with a rating of at least 8.5. -==== - -=== Question 5 (2 pts) - -For this last question, let's just barely start to look at an extremely useful library when it comes to visualizing your data: ggplot2. ggplot2 is one of the go-to libraries for plotting in R. - -We'll provide a pretty substantial amount of code for you to build off of in this question, as this is just a teaser for what we'll be working on in greater detail soon. - -Our end goal is to create a bar plot of the average rating of an episode for each season of Game of Thrones. Take a look at https://ggplot2.tidyverse.org/reference/[this documentation], along with the below sample code, to create this plot. - -[NOTE] -==== -To solve this problem, you only have to replace things in the below example that are within triple quotes! -==== - -[source, r] ----- -library(ggplot2) - -ggplot("""dataframe name""", - aes(x = """x col name""", - y = ave("""y col name""", """x col name""")) + - stat_summary(fun="mean", geom="bar") + - labs(title="TITLE HERE", - x = "X AXIS LABEL" - y = "Y AXIS LABEL") - ) ----- - -Finally, write in a markdown cell which season of Game of Thrones got the highest reviews and which got the lowest. Is this what you would expect? - -.Deliverables -==== -- A bar plot of average ratings per season -- In a markdown cell, which season of GoT was the highest and which was the lowest. -==== - -== Submitting your Work - -Well, this project was a lot of topics in a very very wide range. Hopefully you have a bit more of a feel for just how powerful packages are. - -In the next few weeks we'll do a deep dive on some of the most used packages in R, and then work on making data visualizations to summarize and demonstrate patterns in data that can be a bit easy to miss otherwise. - -.Items to submit -==== -- firstname_lastname_project9.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9.adoc deleted file mode 100644 index 703249684..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-project9.adoc +++ /dev/null @@ -1,153 +0,0 @@ -= TDM 10100: R Project 9 -- 2024 - -**Motivation:** Knowing how to merge data frames in R truly opens up a lot of functionality to us, and allows us to design a more comprehensive analysis of our data sets. - -**Context:** The `merge` function allows us to combine information from multiple data tables. - -**Scope:** merging tables - -.Learning Objectives: -**** -- Learn how to merge data frames -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/icecream/combined/products.csv` (ice cream products) -- `/anvil/projects/tdm/data/icecream/combined/reviews.csv` (ice cream reviews) -- `/anvil/projects/tdm/data/flights/subset/airports.csv` (information about airports) -- `/anvil/projects/tdm/data/flights/subset/1990.csv` (flight data from 1990) - -== Questions - -=== Question 1 (2 pts) - -Read in the ice cream `products` and `reviews` files into two separate data frames, as follows: - -[source, r] ----- -library(data.table) -productsDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv") -reviewsDF <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv") ----- - -Notice that these two data frames have three columns in common, namely, `brand`, `key`, and `ingredients`. If we try to merge these two data frames by default, R will try to match data on all three columns, BUT the `ingredients` column has a different role in these two data frames. In the `productsDF`, the `ingredients` column has a list of the ice cream ingredients. In the `ratingsDF`, the `ingredients` column has values between 1 and 5, or an `NA` value. - -For this reason, when we merge the information from the two tables, we only want to merge the data according to the `brand` and `key` values, as follows: - - -[source,r] ----- -newmergedDF <- merge(productsDF, reviewsDF, by = c("brand", "key") ) ----- - -Notice that the `productsDF` has 8 columns, and the `reviewsDF` has 13 columns, and the `newmergedDF` has 19 columns, namely, all of the columns from both of the previous two data frames, including two separate `ingredients` columns, one from each data frame. - -In this data frame, there are 978 rows that correspond to the `name` being `"Chocolate Chip Cookie Dough"`, which has `brand == "bj"` and `key == "16_bj"`. We can get this subset of the data as follows: - -[source,r] ----- -ChocolateChipCookieDoughDF <- subset(newmergedDF, (brand == "bj") & (key == "16_bj")) ----- - -Notice that this new data frame called `ChocolateChipCookieDoughDF` has 978 rows. - -What are the average number of stars for the 978 reviews in the data frame called `ChocolateChipCookieDoughDF`? - - -.Deliverables -==== -- Print the average number of stars for the 978 reviews in the data frame called `ChocolateChipCookieDoughDF`. -==== - -=== Question 2 (2 pts) - -Starting with the `newmergedDF`, find the average number of `stars` for the reviews of ice cream with `name == "Half Baked\302\256"`. - -[IMPORTANT] -==== -There are two characters that you will not see but they are there. They are encoded as `"\302\256"`. -==== - -.Deliverables -==== -- Print the average number of `stars` for the reviews of ice cream with `name == "Half Baked\302\256"`. -==== - - -=== Question 3 (2 pts) - -In Project 2, we learned about the `grep` and `grepl` functions. Using either of these two functions, you can limit attention to ice cream products that have `"Chocolate"` in the title. (There are 49 such ice cream products.) Find the average number of stars for all 4831 of the reviews of these products that have `"Chocolate"` in the product name. - - -.Deliverables -==== -- Print the average number of stars for all 4831 of the reviews of these products that have `"Chocolate"` in the product name. -==== - -=== Question 4 (2 pts) - -Read in the information about airports, and also the flight data from 1990, into two separate data frames, as follows: - -[source, r] ----- -library(data.table) -airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv") -flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv") ----- - -[IMPORTANT] -==== -It is necessary to have 2 cores in your Jupyter Lab session for Question 4 and Question 5. -==== - -[IMPORTANT] -==== -Do not worry about the warning message from the `fread` function when you read in the `airportsDF` data. -==== - -These two data frames do not have any columns in common! Nonetheless, the `"iata"` values from the `airportsDF` are the three-letter codes corresponding to airports, which are also found in the `Origin` and `Dest` columns in the `flightdataDF`. So when we merge the information from the two tables, if we want to study where the flights depart, then we only want to merge the data according to the `iata` value (from the `airportsDF`) merged with the `Origin` value (from the `flightdataDF`), as follows: - - -[source,r] ----- -mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin") ----- - -Using this new data frame, find the average departure delay for all flights that have `Origin` airport in Indiana. - - -.Deliverables -==== -- Print the average departure delay for all flights that have `Origin` airport in Indiana. -==== - -=== Question 5 (2 pts) - -Using the same data frame from Question 4, find the average departure delay for all flights that have `Origin` airport in Houston, Texas. - -.Deliverables -==== -- Print the average departure delay for all flights that have `Origin` airport in Houston, Texas. -==== - -== Submitting your Work - -Now you are familiar with the method of merging data from multiple data frames. - - -.Items to submit -==== -- firstname_lastname_project9.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-projects.adoc deleted file mode 100644 index 1baf345c5..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/10100/10100-2024-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 10100 - -== Important Links - -xref:fall2024/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:fall2024/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm1010010200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Aug 19, 2024 | Aug 30, 2024 -| Academic Integrity Quiz | Aug 19, 2024 | Aug 30, 2024 -| Project 1 - Working on Anvil in R | Aug 19, 2024 | Aug 30, 2024 -| Project 2 - Tables and data frames in R | Aug 22, 2024 | Aug 30, 2024 -| Project 3 - The tapply function | Aug 29, 2024 | Sep 06, 2024 -| Project 4 - More examples with the tapply function | Sep 05, 2024 | Sep 13, 2024 -| Outside Event 1 | Aug 19, 2024 | Sep 13, 2024 -| Project 5 - Subsets and missing data | Sep 12, 2024 | Sep 20, 2024 -| Project 6 - Indexing | Sep 19, 2024 | Sep 27, 2024 -| Project 7 - Vectorized Operations | Sep 26, 2024 | Oct 04, 2024 -| Outside Event 2 | Aug 19, 2024 | Oct 04, 2024 -| Project 8 - Functions | Oct 03, 2024 | Oct 18, 2024 -| Project 9 - Merging data frames | Oct 17, 2024 | Oct 25, 2024 -| Project 10 - apply functions | Oct 24, 2024 | Nov 01, 2024 -| Project 11 - apply functions | Oct 31, 2024 | Nov 08, 2024 -| Outside Event 3 | Aug 19, 2024 | Nov 08, 2024 -| Project 12 - Working with large data files | Nov 7, 2024 | Nov 15, 2024 -| Project 13 - Mapping in R | Nov 14, 2024 | Nov 29, 2024 -| Project 14 - Class Survey | Nov 21, 2024 | Dec 06, 2024 -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project1.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project1.adoc deleted file mode 100644 index 2a0275a0e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project1.adoc +++ /dev/null @@ -1,266 +0,0 @@ -= TDM 20100: Project 1 -- Welcome to the CLI! - -**Motivation:** The _Command Line Interface_, often referred to simply as the _CLI_, is the bread-and-butter of working with computers. With it, we can navigate through files, search for patterns, create, modify, and delete thousands of files with a single line of code, and more! In the next few projects we'll be learning all about the CLI and what it is capable of. In just a few weeks, you'll be well on your way to mastery of the command line! - -**Context:** Experience working in Anvil will make this project easier to start but is not a prerequisite. - -**Scope:** Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem navigation - -.Learning Objectives: -**** -- Remember how to work in Anvil -- Learn how to navigate the CLI -- Learn how to navigate a filesystem in the CLI -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/` -- `/anvil/projects/tdm/data/flights/subset/` -- `/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -== Questions - -=== Question 1 (2 pts) - -It's been a long summer, so let's start our first project this semester off with a quick review of https://www.rcac.purdue.edu/compute/anvil[Anvil]. In case you haven't already, visit https://ondemand.anvil.rcac.purdue.edu[ondemand.anvil.rcac.purdue.edu] and log in using your ACCESS account credentials. If you don't already have an account, follow https://the-examples-book.com/setup[these instructions] to set one up. If you've forgotten your account credentials or are having other issues related to Anvil, please reach out to datamine-help@purdue.edu with as much information as possible about your issue. - -[IMPORTANT] -==== -Your ACCESS account credentials may not necessarily be the same as your Purdue Career account. -==== - -Once logged in, start a new Anvil session for a few hours (try not to use more than 3 for the moment) and 1 CPU core. - -[NOTE] -==== -For a reminder on how to start a new session on Anvil: - -In the upper-middle part of your screen, you should see a dropdown button labeled `The Data Mine`. Click on it, then select `Jupyter Notebook` from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this: - -image::f24-101-p1-1.png[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -If your screen doesn't look like this, please try and select the correct dropdown option again or visit seminar for more assistance. -==== - -For this first question, we're going to have you get used to working with Jupyter. To start, upload https://the-examples-book.com/projects/_attachments/project_template.ipynb[this project template] to Jupyter and fill in your name and the project number as needed, selecting `seminar` (not `seminar-r`) as your https://the-examples-book.com/tools/anvil/jupyter-lab-kernels[kernel]. - -Then, run the following Python in a code cell. For a more in-depth reminder on working in Jupyter, feel free to take a look at https://the-examples-book.com/projects/fall2024/10100/10100-2024-project1[this year's 10100 project 1] or check out https://the-examples-book.com/tools/anvil/jupyter[this guide on Jupyter]. - -[source, bash] ----- -print("Hello World!") ----- - -Then, in a new code cell, run the following: - -[source, bash] ----- -%%bash - -echo Hello World! ----- - -The first line, `%%bash`, is _cell magic_, which tells our kernel to expect a different language than the default. (In this case, the default is Python and we are telling it to use bash instead.) When using cell magic, it is necessary to have the cell magic as the first line in the cell. If (for instance) a comment is the first thing in the cell, then the cell magic will fail. - -The second line consists of `echo Hello World!`. `echo` is a Bash command similar to `print()` in Python, and we have it print "Hello World!" - -As for https://en.wikipedia.org/wiki/Bash_(Unix_shell)[Bash] (short for _Bourne-Again-SHell_), you can think of it as a programming language for the command line. Bash has a _lot_ of handy tools and commands to learn, and the rest of this project will be spent beginning to learn about Bash. - -The _terminal_ is what we call the area we typically work with the CLI in. While we can run Bash in our Jupyter notebook (as we did above), you will typically work directly in a terminal and throughout this semester we would recommend that you first run your Bash code in a terminal before copying the finished code over to your Jupyter notebook. To open a terminal on Anvil, open a new tab and select `Terminal`, where you'll be greeted with a window that looks somewhat like the following (albeit `jaxmattfair` will be replaced by your access username.) - -image::f24-201-p1-1.png[OnDemand Terminal, width=792, height=500, loading=lazy, title="OnDemand Terminal"] - -Try typing `echo Hello World!` and hitting enter. You should see the terminal print "Hello World!" before waiting for another command. - -To get credit for this question, write a command using `echo` that prints "Hello X!" where "X" is replaced with your name. Be sure to copy your finished command into your Jupyter notebook and run it using _cell magic_ to get credit for your work. - -++++ - -++++ - -.Deliverables -==== -- A command to print "Hello X!" (where "X" is replaced with your name) and the results of running it -- Be sure to document your work from Question 1, using some comments and insights about your work. -==== - -=== Question 2 (2 pts) - -Knowing how to navigate in the shell is helpful. A few notes: - -Absolute paths start with a '/', like this: - -`/anvil/projects/tdm/data/flights/subset/` - -Relative paths do not start with a '/', like this: - -`subset` - -The 'cd' command is used to change directories. - -By default, 'cd' just changes your location back to your home directory. - -You can type 'cd' with absolute paths or relative paths, for instance: - -[source, bash] ----- -%%bash -cd /anvil/projects/tdm/data/flights/subset/ ----- - -or like this: - -[source, bash] ----- -%%bash -cd /anvil/projects/tdm/data/flights/ -cd subset ----- - -If you want to go back to a directory one level higher, type 'cd ..' - -For instance, try this, which first moves our location to the flight `subset` directory, and then back to the `flights` directory, and then back to the `data` directory. - -[source, bash] ----- -%%bash -cd /anvil/projects/tdm/data/flights/subset/ -cd .. -cd .. -pwd ----- - -The `pwd` command prints the working directory. - -The `ls` command prints the contents of the working directory, with only the file names. - -Dr Ward likes to run `ls -la` (those are lowercase letter L's, not number 1's), which shows information about the files in the directories. - -Dr Ward also uses `pwd` a lot, to make sure that he is working in the directory that he intended to be working in. - -[IMPORTANT] -==== -Each bash cell in Jupyter Lab is executed independently, starting from your home directory, as if nothing had been previously run. In other words, bash cells in Jupyter Lab ignore anything that you did in earlier cells. -==== - -Which years of flight data are in the directory: - -`/anvil/projects/tdm/data/flights/subset/`? - -Which years of flight data are in the directory: - -`/anvil/projects/tdm/data/flights/`? - -In which of the two directories are the files bigger in size? - -++++ - -++++ - -.Deliverables -==== -- The year range of flight data in the two directories indicated above, and which directory has bigger file sizes. -- Be sure to document your work from Question 2, using some comments and insights about your work. -==== - -=== Question 3 (2 pts) - -We can use the `head` and the `tail` commands to see the top lines and the bottom lines of a file. By default, we see 10 lines of output, in each case. We can use the `-n` flag to change the number of lines of output that we see. For instance: - -[source, bash] ----- -%%bash - -head -n6 /anvil/projects/tdm/data/flights/subset/1987.csv ----- - -shows the first 6 lines of the `1987.csv` file in the flights `subset` directory. This includes the header line and also the information about the first 5 flights. - -The `cut` command usually takes two flags, namely: - -the `-d` flag that indicates how the data in a flag is delimited (in other words, what character is placed between the pieces of data), and - -the `-f` flag that indicates which fields we want to cut. - -Use the `cut` command to extract all of the origin airports and destination airports from the `1987.csv` file in the flights `subset` directory, and store the resulting origin and destination airports into a file in your home directory. - -You can save the results of your work in bash in a file in your home directory like this: - -[source, bash] ----- -%%bash -myworkinbash >$HOME/originsanddestinations.csv ----- - -++++ - -++++ - -.Deliverables -==== -- Show the head of the file `originsanddestinations.csv` that you created. -- Be sure to document your work from Question 3, using some comments and insights about your work. -==== - -=== Question 4 (2 pts) - -Use the `grep` command to find data in the `1987.csv` file in the flights `subset` directory that contain the pattern `IND`. Save all of the lines of that `1987.csv` file into a new file in your home directory called `indyflights.csv`. - -++++ - -++++ - -.Deliverables -==== -- Show the head of the file `indyflights.csv` that you created. -- Be sure to document your work from Question 4, using some comments and insights about your work. -==== - -=== Question 5 (2 pts) - -Now consider the file: - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -Use the `grep` command to extract all of the lines from this file that contain the word `terrific` and store these reviews in a new file called `terrificreviews.csv` in your home directory. - -If you look at the first line of the file: - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -you will see that field 5 of each line has the number of stars for that product review. - -Among (only) the reviews in the `terrificreviews.csv` file, how many of the reviews had only 1 star? How many had 4 stars? How many had 5 stars? - - -.Deliverables -==== -- From the file `terrificreviews.csv` that you created, how many of the reviews had only 1 star? How many had 4 stars? How many had 5 stars? -- Be sure to document your work from Question 5, using some comments and insights about your work. -==== - - -== Submitting your Work - -With this last question completed, you've successfully made your first dive into the wonderful world of the command line, and can now successfully navigate just about any filesystem we throw at you! This may not seem like it was a hugely difficult project, but the skills you learned in this project are foundational tools that, when built upon, are extremely powerful skills that offer huge benefits in both research and industry. - -In the next project we'll go one step further than simply navigating the filesystem. We will learn how to create, destroy, and move files much more quickly than we can with R or Python. - -Make sure to put all of your work into a Jupyter Lab notebook, and make sure that all of the desired output appears in the notebook. Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don't lose any points. - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project10.adoc deleted file mode 100644 index ca0faf03d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project10.adoc +++ /dev/null @@ -1,188 +0,0 @@ -= TDM 20100: Project 10 -- SQL - -**Motivation:** Now we learn how to write SQL queries that rely on three or more tables. - -**Context:** We can use multiple `JOIN` statements in SQL. Each `JOIN` statement allows us to add data from an additional table. Each `JOIN` needs its own `ON` statement too. - -**Scope:** We can use several `JOIN` statements in the same query, to pull data from several tables. - -.Learning Objectives: -**** -- We will learn how to make SQLite queries on three or more tables at a time (using the `JOIN`) -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset: - -- `/anvil/projects/tdm/data/lahman/lahman.db` (Lahman baseball database) - -Our page in The Examples Book about SQL (in general) is given here: https://the-examples-book.com/tools/SQL/ - -[IMPORTANT] -==== -Now we learn how to join THREE OR MORE tables using multiple `JOIN` statements! -Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: https://the-examples-book.com/tools/SQL/lahman-examples-two-joins All of these examples rely on two (or more) `JOIN` statements, to extract information from three (or more) tables. -==== - -[TIP] -==== -`INNER JOIN` and `JOIN` are exactly the same thing. If you see `INNER JOIN` and you prefer to write `JOIN`, that is totally OK. -==== - - -== Questions - -Using the `seminar` kernel, if you run this line in a cell by itself: - -`%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db` - -[TIP] -==== -If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session. -==== - - -Again, we remind students that the list of all of the tables in this database are: - -[source,bash] ----- -AllstarFull -Appearances -AwardsManagers -AwardsPlayers -AwardsShareManagers -AwardsSharePlayers -Batting -BattingPost -CollegePlaying -Fielding -FieldingOF -FieldingOFsplit -FieldingPost -HallOfFame -HomeGames -Managers -ManagersHalf -Parks -People -Pitching -PitchingPost -Salaries -Schools -SeriesPost -Teams -TeamsFranchises -TeamsHalf ----- - -Please read the examples given here: https://the-examples-book.com/tools/SQL/lahman-examples-two-joins and then you are ready to start the questions for this project! - -[IMPORTANT] -==== -In the page of examples, sometimes we write `JOIN` and sometimes we write `INNER JOIN`. These are interchangeable; in other words, `JOIN` and `INNER JOIN` mean the same thing. (There are other types of statements such as `LEFT JOIN` and `RIGHT JOIN` but we will not use either of these, in this project.) -==== - -=== Question 1 (2 pts) - -Revisit Project 9, Question 1: Now add a third table, namely, the `AwardsPlayers` table, so that we see the 4 Chicago Cubs who won a total of 7 awards in 2023. - -For each of the 4 Chicago Cubs who won these 7 awards in 2023, in addition to printing their `PlayerID`, hits (`H`), home runs (`HR`), `nameFirst`, and `nameLast`, please also add one more variable in the SELECT statement to print, namely, the `awardID` from the `AwardsPlayers` table. - -[TIP] -==== -When you join the `AwardsPlayers` table, it is necessary to join `ON` the value of the `playerID` and the `yearID`. -==== - -.Deliverables -==== -- Print the `playerID`, `H`, `HR`, `nameFirst`, `nameLast`, and `awardID` values for all 4 of the players on the 2023 Chicago Cubs team who won 7 awards altogether during that year. -==== - - -=== Question 2 (2 pts) - -Revisit Project 9, Question 2: Now add a third table, namely, the `People` table, so that we can extract the first and last name of the player: - -Join the `Batting` table to the `Pitching` table by matching the `playerID`, `yearID`, and `stint` columns, and in addition, now also join the `People` table. There is only one person from 2023 appearing in both of these tables that hit more than 30 home runs. Print this person's `playerID`, home runs (`HR`), first name (`nameFirst`), and last name (`nameLast`). - -[TIP] -==== -If you write `Pitching as p` and also `People as p` then your query will not work. Choose a different letter for one of the tables. For instance, you might write: `People as m` (for example!) or any letter you choose is OK. You simply cannot use the same letter for different tables. -==== - - -.Deliverables -==== -- Print the `PlayerID`, home runs (`HR`), first name (`nameFirst`), and last name (`nameLast`) for the only person who is in both the `Batting` and `Pitching` table in 2023 who had more than 30 home runs (`HR`) in the `Batting` table. -==== - - - -=== Question 3 (2 pts) - -Revisit Project 9, Question 3, but this time study Omar Vizquel instead of Rickey Henderson: Now add a third table, namely, the `Salaries` table, so that we can find the total amount of salary that Omar Vizquel earned in his career: - -Join the `People` and `Batting` and `Salaries` tables. Print only 1 row, corresponding to Omar Vizquel, displaying his `playerID`, `nameFirst`, `nameLast`, `SUM\(R)`, `SUM(SB)`, and `SUM(salary)` values. - -[TIP] -==== -Omar Vizquel had 1445 runs scored altogether and 404 stolen bases, and he made more than 60 million dollars in salary during his career! Your solution will show his exact total salary during his career. -==== - - -.Deliverables -==== -- Print only 1 row, corresponding to Omar Vizquel, displaying his `playerID`, `nameFirst`, `nameLast`, `SUM\(R)`, `SUM(SB)`, and `SUM(salary)` values. -==== - - -=== Question 4 (2 pts) - -a. Revisit Project 9, Question 4, but now join the `Batting`, `People`, and `Appearances` table, to find the top 5 players of all time, in terms of their total number of hits, in other words, according to `SUM(H)`. For the top 5 players (in terms of the total number of hits), print their `playerID`, the `SUM(H)` (in other words, their total number of hits in their careers), their `nameFirst` and `nameLast` values, and now also include a column that shows the `SUM(G_all)` which is the total number of games played in their career. [Do not change the ordering from Project 9, Question 4; in other words, please continue to keep the results in order by the total number of hits.] - -b. Same question as 4b, but this time use home runs (according to `SUM(HR)`) instead of hits. - -[TIP] -==== -When you join the `Appearances` table, make sure that the `playerID` and `yearID` and `teamID` are all in agreement with the `Batting` table. -==== - - -.Deliverables -==== -- For the top 5 players (in terms of the total number of hits), print their `playerID`, the `SUM(H)` (in other words, their total number of hits in their careers), their `nameFirst` and `nameLast` values, and now also include a column that shows the `SUM(G_all)` which is the total number of games played in their career. -- For the top 5 players (in terms of the total number of home runs), print their `playerID`, the `SUM(HR)` (in other words, their total number of home runs in their careers), their `nameFirst` and `nameLast` values, and now also include a column that shows the `SUM(G_all)` which is the total number of games played in their career. -==== - - -=== Question 5 (2 pts) - -Join the `CollegePlaying` and `People` and `HallOfFame` tables to find the `playerID`, `nameFirst`, `nameLast`, `yearID`, `ballots`, `needed`, `votes`, and `inducted` values for the only player who had `schoolID = 'purdue'` in the `CollegePlaying` table and who also appears in the `HallOfFame` table. [There is only 1 such player!] - -.Deliverables -==== -- Print the `playerID`, `nameFirst`, `nameLast`, `yearID`, `ballots`, `needed`, `votes`, and `inducted` values for the only player who had `schoolID = 'purdue'` in the `CollegePlaying` table and who also appeared in the `HallOfFame` table. -==== - - -== Submitting your Work - -Now that you know how to join three tables together, you are very knowledgeable about SQL databases! - - - -.Items to submit -==== -- firstname-lastname-project10.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project11.adoc deleted file mode 100644 index b8eafbcfc..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project11.adoc +++ /dev/null @@ -1,133 +0,0 @@ -= TDM 20100: Project 11 -- SQL - -**Motivation:** Now we will apply our SQL skills by studying movies and TV shows. - -**Context:** The Internet Movie DataBase https://www.imdb.com provides data tables here: https://datasets.imdbws.com which we have stored in a database for you here: /anvil/projects/tdm/data/movies_and_tv/imdb2024.db - -**Scope:** There are 7 tables to get familiar with: `akas`, `basics`, `crew`, `episode`, `name`, `principals`, `ratings` - -.Learning Objectives: -**** -- We will learn how to use SQL to analyze several aspects of movies and TV shows -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset: - -- `/anvil/projects/tdm/data/movies_and_tv/imdb2024.db` (Internet Movie DataBase (IMDB)) - -Our page in The Examples Book about SQL (in general) is given here: https://the-examples-book.com/tools/SQL/ - - -== Questions - -Using the `seminar` kernel, if you run this line in a cell by itself: - -`%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb2024.db` - -then you will have the movies and TV database loaded. - -[TIP] -==== -If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session. -==== - -The tables in this database are: - -[source,bash] ----- -akas -basics -crew -episode -name -principals -ratings ----- - -=== Question 1 (2 pts) - -a. From the `basics` table, display the entry for Friends. (The title constant for Friends is `tt0108778`. Notice that this corresponds to the IMDB webpage for Friends: https://www.imdb.com/title/tt0108778 from IMDB.) - -b. Find all of the entries of the `principals` table that correspond to people in Friends. - -c. Use the `episode` table to discover how many episodes occurred during each season of Friends. For each season, print the season number and the number of episodes in that season. - -[TIP] -==== -Notice that the `episode` table has two columns of title constants: one of the episode title constant (which, in the table, is `tconst`) and the other is the main show title constant (which, in the table, is `parentTconst`). -==== - -.Deliverables -==== -- From the `basics` table, display the entry for Friends. -- Find all of the entries of the `principals` table that correspond to people in Friends. -- Use the `episode` table to discover how many episodes occurred during each season of Friends. For each season, print the season number and the number of episodes in that season. -==== - - -=== Question 2 (2 pts) - -Join the `ratings` and the `basics` table, to find the 13 titles that each have more than 2 million ratings. For each such title, output these values: `tconst`, `averageRating`, `numVotes`, `primaryTitle`, `startYear`, `runtimeMinutes`, and `genres` - -.Deliverables -==== -- For each of the 13 titles that each have more than 2 million ratings, output these values: `tconst`, `averageRating`, `numVotes`, `primaryTitle`, `startYear`, `runtimeMinutes`, and `genres` -==== - - - -=== Question 3 (2 pts) - -Using the `startYear` values from the `basics` table, find the total number of entries in each `startYear`. - -.Deliverables -==== -- For each `startYear` value from the `basics` table, print the `startYear` and the total number of entries in corresponding to that `startYear`. -==== - - -=== Question 4 (2 pts) - -a. From the `name` table, find the nconst value for Emma Watson. (Notice that there are several entries with this name, but probably only one of them is the one that you want to analyze.) - -b. How many entries in the `principals` table correspond to Emma Watson (using only the correct value of `nconst` that you found in part a)? - -.Deliverables -==== -- From the `name` table, find the nconst value for Emma Watson. (Although several values appear, just find the 1 value that is correct for her.) -- How many entries in the `principals` table correspond to Emma Watson? -==== - - -=== Question 5 (2 pts) - -Join the `basics` and the `ratings` table to find the 3 entries that have `startYear = 2024` and `numVotes > 100000` and `averageRating > 8`. (Print all of the columns from both tables, for these 3 entries.) - -.Deliverables -==== -- Join the `basics` and the `ratings` table to find the 3 entries that have `startYear = 2024` and `numVotes > 100000` and `averageRating > 8`. (Print all of the columns from both tables, for these 3 entries.) -==== - - -== Submitting your Work - -We see that the SQL skills that we learned for the Lahman baseball database are directly applicable to analyzing the movies and TV database too! It is a good feeling to be able to apply what we have learned in a new setting! - - - -.Items to submit -==== -- firstname-lastname-project11.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project12.adoc deleted file mode 100644 index dc6b23f59..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project12.adoc +++ /dev/null @@ -1,320 +0,0 @@ -= TDM 20100: Project 12 -- SQL - -**Motivation:** We have used two SQL databases but we have not (yet) built a database of our own. - -**Context:** It is straightforward to build a new database from a collection of csv files. - -**Scope:** In SQLite, we demonstrate the setup for building a new database. - -.Learning Objectives: -**** -- We will learn how to build a new SQL database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset: - -- `/anvil/projects/tdm/data/flights/subset/*` (flight data) - - -== Questions - - -=== Question 1 (2 pts) - -First, open a terminal and combine the data from the subset flight csv files as follows. (We are storing the resulting file in the `$SCRATCH` directory because it is very large. We are also removing NA values (using `awk`) and removing the header from each file (using `grep`). - -[WARNING] -==== -The file that we are about to build on the next line will be large, and so it will take a few minutes to run. -==== - -[source,bash] ----- -cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | awk -F, -v OFS=, '{for (i=1; i<=NF; i++) if ($i == "NA") $i=""};1' | grep -v Year >$SCRATCH/myflightdata.csv ----- - -The `airports.csv` data has a few extra commas: - -[source,bash] ----- -cat /anvil/projects/tdm/data/flights/subset/airports.csv | sed 's.Union County, Troy Shelton .Union County Troy Shelton.g' | sed 's.Savage, Sr.Savage Sr.g' | sed 's.Baton Rouge Metropolitan, Ryan .Baton Rouge Metropolitan Ryan.g' | sed 's.Lawrence County Airpark,Inc.Lawrence County Airpark Inc.g' | sed 's.Westport, NY.Westport NY.g' | sed 's.Pullman/Moscow,ID.Pullman/Moscow ID.g' | sed 's.Reading Muni,Gen Carl A Spaatz.Reading Muni Gen Carl A Spaatz.g' | sed 's.Richard Lloyd Jones, Jr.Richard Lloyd Jones Jr.g' | sed 's.Toccoa, R G Le Tourneau .Toccoa R G Le Tourneau .g' | sed 's.\\"Bud\\" Barron .Bud Barron.g' | sed 's."..g' >$SCRATCH/mycleanairports.csv ----- - -The `carriers.csv` data has double quotes that we do not want: - -[source,bash] ----- -cat /anvil/projects/tdm/data/flights/subset/carriers.csv | sed 's."..g' | awk -F, '{if (NF == 3) {print $1","$2 $3} else {print $0}}' >$SCRATCH/mycleancarriers.csv ----- - -The `plane-data.csv` sometimes only has 1 column, and sometimes has 9 columns. We clean this up too: - -[source,bash] ----- -cat /anvil/projects/tdm/data/flights/subset/plane-data.csv | awk -F, '{if (NF == 9) {print $0} else {print $1",,,,,,,,"}}' >$SCRATCH/mycleanplanedata.csv ----- - -Now, also in the terminal, make a new SQLite file. We also make this file in the `$SCRATCH` directory, so that we do not fill up your home directory: - -[source,bash] ----- -sqlite3 $SCRATCH/newflightdatabase.db ----- - -(Whenever we want to quit the `sqlite3` program, we can hit CONTROL-D but do NOT YET hit CONTROL-D, because we still need to build the database.) - -Now we tell SQLite that our files are in ASCII format: - -[source,sql] ----- -.mode ascii ----- - -and the files to be imported are comma separated: - -[source,sql] ----- -.separator "," "\n" ----- - -and we make tables for the data, first for the flight data: - -[source,sql] ----- -CREATE TABLE flights( - "Year" INTEGER, - "Month" INTEGER, - "DayofMonth" INTEGER, - "DayOfWeek" INTEGER, - "DepTime" INTEGER, - "CRSDepTime" INTEGER, - "ArrTime" INTEGER, - "CRSArrTime" INTEGER, - "UniqueCarrier" TEXT, - "FlightNum" INTEGER, - "TailNum" TEXT, - "ActualElapsedTime" INTEGER, - "CRSElapsedTime" INTEGER, - "AirTime" INTEGER, - "ArrDelay" INTEGER, - "DepDelay" INTEGER, - "Origin" TEXT, - "Dest" TEXT, - "Distance" INTEGER, - "TaxiIn" INTEGER, - "TaxiOut" INTEGER, - "Cancelled" INTEGER, - "CancellationCode" INTEGER, - "Diverted" INTEGER, - "CarrierDelay" INTEGER, - "WeatherDelay" INTEGER, - "NASDelay" INTEGER, - "SecurityDelay" INTEGER, - "LateAircraftDelay" INTEGER -); ----- - -and for the airports data: - -[source,sql] ----- -CREATE TABLE airports( - "iata" TEXT, - "airport" TEXT, - "city" TEXT, - "state" TEXT, - "country" TEXT, - "lat" NUMERIC, - "long" NUMERIC -); ----- - -and for the carriers data: - -[source,sql] ----- -CREATE TABLE carriers( - "Code" TEXT, - "Description" TEXT -); ----- - -and for the plane data: - -[source,sql] ----- -CREATE TABLE planes( - "tailnum" TEXT, - "type" TEXT, - "manufacturer" TEXT, - "issue_date" TEXT, - "model" TEXT, - "status" TEXT, - "aircraft_type" TEXT, - "engine_type" TEXT, - "year" INTEGER -); ----- - -Next, import the actual data into the tables that we created above. The first one will take a few minutes to run! - -[WARNING] -==== -The first import statement will take all of the data from the huge file we built at the start, and put that data into our database. So it will take a few minutes to run. -==== - -[WARNING] -==== -In all 4 of these import statements, `mdw` is Dr Ward's username. You need to (instead) change `mdw` to your username from Anvil in each of the following 4 input lines! -==== - -[source,sql] ----- -.import --skip 1 /anvil/scratch/x-mdw/myflightdata.csv flights ----- - -and the airports data: - -[source,sql] ----- -.import --skip 1 /anvil/scratch/x-mdw/mycleanairports.csv airports ----- - -and the carriers data: - -[source,sql] ----- -.import --skip 1 /anvil/scratch/x-mdw/mycleancarriers.csv carriers ----- - -and the planes data: - -[source,sql] ----- -.import --skip 1 /anvil/scratch/x-mdw/mycleanplanedata.csv planes ----- - -Next, we want to build indices for the flight data: - -[source,sql] ----- -CREATE INDEX ix_flights_covering ON flights(Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay); ----- - -and for the airports data: - -[source,sql] ----- -CREATE INDEX ix_airports_covering ON airports(iata,airport,city,state,country,lat,long); ----- - -and for the carriers data: - -[source,sql] ----- -CREATE INDEX ix_carriers_covering ON carriers(Code,Description); ----- - -and for the planes data: - -[source,sql] ----- -CREATE INDEX ix_planes_covering ON planes(tailnum,type,manufacturer,issue_date,model,status,aircraft_type,engine_type,year); ----- - - -Finally, you can exit from SQLite by typing: `CONTROL-D`. - -Afterwards, check the size of the file that you created, and indicate the size of the file in your solutions (it should be approximately 17 GB) - -[source,bash] ----- -ls -la --block-size=G $SCRATCH/newflightdatabase.db ----- - - -.Deliverables -==== -- Because all of the work for Question 1 happens in the terminal, the *only* thing that we are asking you to put into the Jupyter Lab notebook for Question 1 is the output from this command: `ls -la --block-size=G $SCRATCH/newflightdatabase.db` which prints the file size for the database that you built in Question 1. This line should show that your database is approximately 17 GB. -==== - - -=== Question 2 (2 pts) - -[NOTE] -==== -Back in the regular Jupyter Lab notebook, using the `seminar` kernel, you can load the database that you created like this: - -`%sql sqlite:////anvil/scratch/x-mdw/newflightdatabase.db` - -but (of course) change the `mdw` to your ACCESS username. -==== - -Join the `flights` and the `airports` table, matching the `Origin` column to the `iata` column. Find the total number of flights in the database for each `Origin` airport that is located in Texas. For each `Origin` airport in Texas, print the total number of flights and the 3-letter `Origin` airport code. - -.Deliverables -==== -- For each `Origin` airport in Texas, print the total number of flights and the 3-letter `Origin` airport code. -==== - - - -=== Question 3 (2 pts) - -a. From the `flights` table, find the 10 most popular `TailNum` values, according to how many times that each `TailNum` appears in the `flights` table. For each of these top 10 `TailNum`, list the `TailNum` and the number of flights on that `TailNum`. - -b. Notice that the 5 most popular `TailNum` values are: (blank), UNKNOW, 0, NKNO, 000000. Ignoring these top 5 most popular values, in part b, we want you to consider (only) the 6th most popular `TailNum` value, which should be `N525`. You can read about this 6th most popular airplane here: https://www.flightaware.com/live/flight/N525 For *only* this 6th most popular airplane, with `TailNum` equal to `N525`, please make a separate query of the `flights` table that shows the top 5 `Origin` airports for this plane's flights. (Hint: This airplane has departed 2952 times from Dallas Love Field `DAL` and also 2146 times from Phoenix's Sky Harbor International Airport `PHX`.) - -.Deliverables -==== -- For each of these top 10 `TailNum`, list the `TailNum` and the number of flights on that `TailNum`. -- After identifying the 6th most popular airplane (from part a; which is the first *valid* airplane; it should have `tailnum` equal to `N525`), now find the top 5 `Origin` airports for this specific plane's flights. For each of these top 5 `Origin` airports for this plane, find the three-letter code of the `Origin` airport and the number of times that this specific airplane departed from each such `Origin`. -==== - - -=== Question 4 (2 pts) - -Now let's revisit question 3, but this time we will JOIN the `flights` table and the `planes` table ON the `TailNum` value. Group the results according to the `TailNum` and find the 10 most popular values, listing the `TailNum` value and the number of flights for each such `TailNum`. - -[NOTE] -==== -Notice that the invalid tail numbers from question 3 are gone (because they do not appear in the `planes` table) and also the `TailNum` that you discovered in question 3 is gone too (because it does not appear in the `planes` table either). Hint: The top `TailNum` for this question is `N908DE` which had `25050` flights altogether. -==== - -.Deliverables -==== -- JOIN the `flights` table and the `planes` table, to find the 10 most popular values, listing the `TailNum` value and the number of flights for each such `TailNum`. -==== - - -=== Question 5 (2 pts) - -Join the `flights` and the `carriers` table, matching the `UniqueCarrier` column to the `Code` column. Find the total number of flights in the database for each `UniqueCarrier`. For each `UniqueCarrier`, print the `UniqueCarrier` value, the `Description` value, and also the total number of flights for that `UniqueCarrier`. (Hint: Your query results should have 29 rows altogether.) - -.Deliverables -==== -- For each `UniqueCarrier`, print the `UniqueCarrier` value, the `Description` value, and also the total number of flights for that `UniqueCarrier`. -==== - - -== Submitting your Work - -We have now built on the same skills that we learned for the movies database and the baseball database, but this time, we developed our own database of airplane flights and answered questions about the database that we built! - - -.Items to submit -==== -- firstname-lastname-project12.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project13.adoc deleted file mode 100644 index 1b017d4a2..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project13.adoc +++ /dev/null @@ -1,172 +0,0 @@ -= TDM 20100: Project 13 -- SQL - -**Motivation:** We have used three SQL databases directly, from the SQL prompt. Now we demonstrate how to make SQL calls from R, so that we can (for example) make plots related to our SQL queries. - -**Context:** When we make a SQL call in R, the data is returned as an R database. - -**Scope:** This project will synthesize what you have learned about how to make database calls from SQL and how to use R to visualize data from data frames. - -.Learning Objectives: -**** -- We will make SQL calls from R. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following three databases: - -- `/anvil/projects/tdm/data/lahman/lahman.db` (Lahman baseball database) -- `/anvil/projects/tdm/data/movies_and_tv/imdb2024.db` (Internet Movie DataBase (IMDB)) -- `/anvil/scratch/x-mdw/newflightdatabase.db` (the flight database that you built in Project 12) - -[NOTE] -==== -Please change `mdw` to your username in the location of the flight database. -==== - - - -== Questions - -You can load data from a SQL query into R (using the `seminar-r` kernel in Jupyter Lab) as follows, for example: - -[source,r] ----- -library(RSQLite) -conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/lahman/lahman.db") -myDF <- dbGetQuery(conn, "SELECT * FROM batting LIMIT 5;") -head(myDF) ----- - -Once you have the connection to the database loaded, you can make more queries, without needing to re-run the `dbConnect` line. You can go directly to another `dbGetQuery` line, like this: - -[source,r] ----- -myDF <- dbGetQuery(conn, "SELECT * FROM pitching LIMIT 5;") -head(myDF) ----- - -If your kernel dies at any point, or if you start a new session, you will obviously need to go back and re-load your `RSQLite` library and also re-connect to the database, using the `dbConnect` command. - -You can even make complex queries, for instance: - -[source,r] ----- -myDF <- dbGetQuery(conn, "SELECT * FROM batting as b JOIN people as p - ON b.playerID = p.playerID - WHERE p.nameFirst = 'Rickey' - AND p.nameLast = 'Henderson';") -head(myDF) ----- - -and we can plot the data, for instance, like this: -[source,r] ----- -myDF <- dbGetQuery(conn, "SELECT b.R as myruns, b.yearID as myyears - FROM batting as b JOIN people as p - ON b.playerID = p.playerID - WHERE p.nameFirst = 'Rickey' - AND p.nameLast = 'Henderson' - GROUP BY b.yearID;") -plot(myDF$myyears, myDF$myruns) ----- - - - -=== Question 1 (2 pts) - -Using the `seminar-r` kernel in Jupyter Lab, open a connection to the Lahman database using the `dbConnect` process that is outlined above. - -Revisit your work from Project 8, Question 4, using the Lahman baseball database, but this time, make a `dotchart`, as follows: - -Use the Batting table to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). Instead of printing the output, this time make a dotchart with 5 rows. Each row should show the `playerID` of each player and the total number of hits in each of their careers. - -.Deliverables -==== -Make a dotchart with 5 rows for the top 5 players of all time, in terms of their total number of hits, `SUM(H)`. Each row should show the `playerID` of each player and the total number of hits in each of their careers. -==== - - -=== Question 2 (2 pts) - -Revisit your work from Project 8, Question 5, using the Lahman baseball database, but this time, make a `dotchart`, as follows: - -Consider the Schools table, group together the schools in each state. Find the number of schools in each group, using `SELECT COUNT(*) as mycounts, state` so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of mycounts in descending order (which is denoted by DESC), in other words, the states with the most schools are printed first in your list. - -In this way, by using LIMIT 5, you can make a dotchart that displays the 5 states with the most schools, and the number of schools in each state. - -.Deliverables -==== -Make a dotchart that displays the 5 states with the most schools, and the number of schools in each state. -==== - - - -=== Question 3 (2 pts) - -Revisit your work from Project 11, Question 2, using the IMDB Movies database, but this time, make a `dotchart`, as follows: - -Join the ratings and the basics table, to find the 13 titles that each have more than 2 million ratings. Make a dotchart for these 13 titles, showing the `primaryTitle` and the number of ratings for each of these 13 titles. - - - -.Deliverables -==== -Make a dotchart for these 13 titles, showing the `primaryTitle` and the number of ratings for each of these 13 titles. -==== - - -=== Question 4 (2 pts) - -Revisit your work from Project 11, Question 3, using the IMDB Movies database, but this time, make a `plot`, as follows: - -a. Using the startYear values from the basics table, find the total number of entries in each startYear. Make a plot that shows the `startYear` on the x-axis and the number of entries from each `startYear` on the y-axis. - -b. Now fix your plot from part (a), so that you only show the results in which `myDF$startYear > 0`. - - - -.Deliverables -==== -- Make a plot that shows the `startYear` on the x-axis and the number of entries from each `startYear` on the y-axis. -- Now fix your plot from part (a), so that you only show the results in which `myDF$startYear > 0`. -==== - - -=== Question 5 (2 pts) - -Revisit your work from Project 12, Question 2, using the flights database that you built, but this time, make a `dotchart`, as follows: - -Join the `flights` and the `airports` table, matching the `Origin` column to the `iata` column. Find the total number of flights in the database for each `Origin` airport that is located in Texas. Make a dotchart that shows, for each `Origin` airport in Texas, the total number of flights and the 3-letter `Origin` airport code. - -[NOTE] -==== -There are 29 airports in Texas that should appear in your `dotchart`. It is OK to put the `dotchart` in any order that you like, i.e., in numerical order, alphabetical order, or any other order is OK! -==== - -.Deliverables -==== -Make a dotchart that shows, for each `Origin` airport in Texas, the total number of flights and the 3-letter `Origin` airport code. -==== - - -== Submitting your Work - -Now we know how to leverage our knowledge of SQL when working in R! - - - -.Items to submit -==== -- firstname-lastname-project13.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project14.adoc deleted file mode 100644 index 3c8f0e1ee..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project14.adoc +++ /dev/null @@ -1,89 +0,0 @@ -= TDM 20100: Project 14 -- 2024 - -**Motivation:** We covered a _lot_ this semester, including shell scripts and databases. We hope that you have had the opportunity to learn a lot, and to improve your data science skills. For our final project of the semester, we want to provide you with the opportunity to give us your feedback on how we connected different concepts, built up skills, and incorporated real-world data throughout the semester, along with showcasing the skills you learned throughout the past 13 projects! - -**Context:** This last project will work as a consolidation of everything we've learned thus far, and may require you to back-reference your work from earlier in the semester. - -**Scope:** reflections on Data Science learning - -.Learning Objectives: -**** -- Reflect on the semester's content as a whole -- Offer your thoughts on how the class could be improved in the future -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -The Data Mine team is writing a Data Mine book to be (hopefully) published in 2025. We would love to have a couple of paragraphs about your Data Mine experience. What aspects of The Data Mine made the biggest impact on your academic, personal, and/or professional career? Would you recommend The Data Mine to a friend and/or would you recommend The Data Mine to colleagues in industry, and why? You are welcome to cover other topics too! Please also indicate (yes/no) whether it would be OK to publish your comments in our forthcoming Data Mine book in 2025. - -.Deliverables -==== -Feedback and reflections about The Data Mine that we can potentially publish in a book in 2025. -==== - -=== Question 2 (2 pts) - -Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -.Deliverables -==== -- A markdown cell detailing your favorite dataset, why, and a working example and question you did involving that dataset. -==== - -=== Question 3 (2 pts) - -While working on the projects, how did you validate the results that your code produced? For instance, did you try to solve problems in 2 different ways? Or did you try to make summaries and/or visualizations? How did you prefer to explore data and learn about data? Are there better ways that you would suggest for future students (and for our team too)? Please illustrate your approach using an example from one problem that you addressed this semester. - -.Deliverables -==== -- A few sentences in a markdown cell on how you conducted your work, and a relevant working example. -==== - -=== Question 4 (2 pts) - -Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please cite specific questions and explain both how they confused you and how you would recommend improving them. - -.Deliverables -==== -- A few sentences in a markdown cell on which questions from projects you found confusing, and how they could be written better/more clearly, along with specific examples. -==== - -=== Question 5 (2 pts) - -Please identify 3 skills or topics related to the bash shell, or databases, or data science (in general) that you wish we had covered in our projects. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - -.Deliverables -==== -- A markdown cell containing 3 skills/topics that you think we should've covered in the projects, and an example of why you believe these topics or skills could be relevant and beneficial to students going through the course. -==== -=== OPTIONAL but encouraged: - -Please connect with Dr Ward on LinkedIn: https://www.linkedin.com/in/mdw333/ - -and also please follow our Data Mine LinkedIn page: https://www.linkedin.com/company/purduedatamine/ - -and join our Data Mine alumni page: https://www.linkedin.com/groups/14550101/ - - - -== Submitting your Work - -If there are any final thoughts you have on the course as a whole, be it logistics, technical difficulties, or nuances of course structuring and content that we haven't yet given you the opportunity to voice, now is the time. We truly welcome your feedback! Feel free to add as much discussion as necessary to your project, letting us know how we succeeded, where we failed, and what we can do to make this experience better for all our students and partners in 2025 and beyond. - -We hope you enjoyed the class, and we look forward to seeing you next semester! - -.Items to submit -==== -- firstname_lastname_project14.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2-teachinglearning-backup.adoc deleted file mode 100644 index 8abdc9209..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2-teachinglearning-backup.adoc +++ /dev/null @@ -1,316 +0,0 @@ -= TDM 20100: Project 2 -- Manipulating the Filesystem - -**Motivation:** In the previous project we took a minute to get re-familiarized with working on Anvil before diving straight into the CLI, learning how to move around the filesystem. Now that we know how to move around, we are ready to learn how to manipulate the filesystem. By learning to create, destroy, and move files and directories, along with some basic commands to begin to analyze files, we will be well on our way to performing some primitive forms of data analysis using nothing but the terminal! - -**Context:** The ability to use `cd`, `pwd`, and `ls` to orient yourself in the filesystem, along with a basic understanding of `man` pages, will make this project drastically easier on you. - -**Scope:** Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem manipulation - -.Learning Objectives: -**** -- Learn how to create and destroy files from the CLI -- Learn how to create and destroy directories from the CLI -- Learn how to move files and directories around the filesystem -- Learn about basic file analysis commands -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/*` (we will use many datasets, in brief) - -== Questions - -=== Question 1 (2 pts) - -To start, let's establish a convention that will make your work on these next few projects _much_ easier. Any time that we can, we should keep all of our work in one place: the `$SCRATCH` directory. Not only will this help you keep track of everything, but it will also make sure that you don't accidentally delete any of your projects by mistake (assuming you keep your projects in your `$HOME` directory, as many do). So just remember to start your code cells off with `cd $SCRATCH` whenever appropriate, and never be afraid to `pwd` a whole bunch to be sure that you're in the right place. - -Let's talk about creating and removing files. As you'll see with many things in the terminal there are a lot of different ways to accomplish the same task. - -First off, the creation of files. While our more Bash-savvy students may say "Just open your favorite text editor and it'll make a new file for you!", we are going to refrain from discussing CLI text-editors like Vim and Nano, as their applications would not be very directly helpful for this class. Instead, we'll discuss the `touch` command. Creating a file is as easy as using `touch filename` to create a new file called `filename` in your current working directory (the one printed by `pwd`). However, it is often best practice to make sure that your file has the proper extension (like `.txt`, `.xdocx`, `.pdf`, and more!) in order to ensure the computer knows how to open your file. - -Deletion of files is similarly easy using the `rm` command, where `rm filename` will delete a given file. Both `touch` and `rm` have their fair share of optional arguments, which you can examine in detail by viewing their respective man pages (using `man touch`, for example). Be sure to exercise _great_ caution when using `rm`, as this isn't like dropping a file into the recycling bin on your desktop. When you `rm` a file, it is gone. Permanently. End of discussion (...mostly). Do _not_ use `rm` before being sure that you're okay with deleting what you're deleting, and be sure you understand any arguments you may be using. - -[IMPORTANT] -==== -For an added degree of caution, you can use `rm -i` to be prompted before the final removal of a file. This is a good safeguard when just starting out with `rm` -==== - -Thirdly, let's discuss putting content into files. While once again this can be done with text editors, this course will not focus on the manual population of file contents and will instead deal with processing and managing data using tools. By-hand manipulation of real-world data is, after-all, completely impractical! - -Try running the following code, which creates a file named `start.txt` in your scratch directory, adds the text "The meaning of life is the number 42." to the file, `echo`s the contents of the file to the console, and then removes the file. - -[source, python] ----- -%%bash -cd $SCRATCH - -touch starter.txt # create the file starter.txt -echo "The meaning of life is the number 42." > starter.txt # add contents to starter.txt -echo " -------------------------- " # spacer -cat starter.txt # print contents of file -rm starter.txt ----- - -For this question, we want you to write code that creates a new file called `greeting.txt` in your scratch directory, fills in the file with the text "Hello World!", prints the contents of the file, and then removes the file. Feel free to do this all in one cell, or spread among multiple cells if you prefer. Feel free to refer to the above example code for a _very_ big headstart into this question. - -.Deliverables -==== -- Bash to create, populate, then delete `greeting.txt` as specified -==== - -=== Question 2 (2 pts) - -Next let's discuss creating directories. After all, computers have a _lot_ of files. We would be limited in a lot of ways if we stored everything in one place. - -Creating directories can be done using `mkdir`. For example, `mkdir fakefolder` would create a new directory called `fakefolder`. By using `rmdir`, we can easily remove directories. - -Run the below example step-by-step, examining the outputs of each step. - -[source, python] ----- -%%bash -cd $SCRATCH - -# Step 1 - Show the directory doesn't exist -ls -l - -# Step 2 - Create the directory, and show it exists -mkdir fakefolder -ls -l - -# Step 3 - Delete the directory, and show it no longer exists -rmdir fakefolder -ls -l ----- - -Next, let's take a look at a more complex example. While `rmdir` is capable of deleting empty directories, it struggles with directories that still have contents we also want to delete. Run the below example, and observe the resulting error: - -[source, python] ----- -%%bash -cd $SCRATCH -mkdir fakefolder -touch fakefolder/fakefile -rmdir fakefolder ----- - -For directories with contents, we'll have to refer back to trusty old `rm`. Passing the recursive flag, `-r`, to `rm` will cause it to delete a directory and all its contents. With that knowledge in mind, take another few seconds to recognize the amount of damage one could accidentally due with misuse of these commands. Once you've reflected on the possible consequences of using `rm` without caution, try and complete the below activities. - -. Create a directory called `emptycase` in the scratch directory -. Remove `emptycase` and then list the contents of `$SCRATCH` -. Create a directory called `fullcase` in the scratch directory -. Create a new file, `contents.txt`, within `fullcase` -. List the contents of `$SCRATCH`, then list the contents of `fullcase` -. Remove `fullcase` using the recursive argument to `rm` -. List the contents of `$SCRATCH` - - -.Deliverables -==== -- Commands to complete the above 7 steps -==== - -=== Question 3 (2 pts) - -In this question, we'll take a look at _moving_ and _copying_ files. Again, there are many ways of accomplishing this, with `mv` and `cp` being some of the more common ones. Let's briefly discuss both. - -- http://man.he.net/?topic=mv§ion=all[`mv`] can be used to move or rename files and directories -- http://man.he.net/?topic=cp§ion=all[`cp`] can be used to copy files and directories to other locations - -Below are a few example snippets of each. Take a look and feel free to run them on your own, as understanding each of these separately will better enable you to tackle the tasks at the end of the question. Be sure to run each snippet in order to ensure they run correctly. - -[source, Python] ----- -%%bash -cd $SCRATCH - -# set-up for other examples -mkdir dir1 -touch dir1/file2 - -mkdir dir2 -touch dir2/file1 -touch dir2/file3 - -# prints results -echo dir1: -ls -l dir1 -echo -echo dir2: -ls -l dir2 ----- - -[source, Python] ----- -%%bash -cd $SCRATCH - -# copy file2 from dir1 to dir2, then delete the dir1 version -cp dir1/file2 dir2/file2 -rm dir1/file2 - -# prints results -echo dir1: -ls -l dir1 -echo -echo dir2: -ls -l dir2 ----- - -[source, Python] ----- -%%bash -cd $SCRATCH - -# move files 1 and 2 from dir2 to dir1 -# note that * is another way of saying 'all files' here -mv dir2/* dir1/ - -# rename dir2 to dir3 -mv dir2 dir3 - -# prints results -echo dir1: -ls -l dir1 -echo -echo dir3: -ls -l dir3 ----- - -Using the above code snippets as a guide, for this question we want you to: - -. Create two new directories in `$SCRATCH`, `directoryA` and `directoryB` -. Create a new file called `fileA` in `directoryB`, and a matching `fileB` in `directoryA` -. Make the contents of `fileA`, `This is file A!` -. Make the contents of `fileB`, `This is file C!` -. Move `fileA` to `directoryA` and `fileB` to `directoryB` -. Rename `directoryB` to `directoryC` and `fileB` to `fileC` -. Use `ls -l` on both directories to show your final results. - -.Deliverables -==== -- Bash code to perform the above instructions -==== - -=== Question 4 (2 pts) - -With the creation, deletion, and movement of files handled, let's now get into some basic tools for analyzing files. We'll start with printing the first few lines using https://explainshell.com/explain/1/head[`head`], the last few lines using https://explainshell.com/explain?cmd=tail[`tail`], and the total contents of a file using https://explainshell.com/explain?cmd=cat[`cat`]. - -`head` can be used to print the first 5 lines of a file (when it isn't told to print more or less). By using the `-n` flag, we can tell it to print an exact number of lines as well. `tail` can be thought of as the exact same as `head`, but starting from the ending of the file and moving backwards. For example, `head` will get the first 5 lines of a file by default while `tail` will get the last 5 lines. See below for a concrete example. - -[source, python] ----- -%%bash -# prints the first 5 lines of USvideos.csv -head /anvil/projects/tdm/data/youtube/USvideos.csv - -# prints the last 5 lines of USvideos.csv -tail /anvil/projects/tdm/data/youtube/USvideos.csv - -# print the first 2 lines of destinations.csv -head -n 2 /anvil/projects/tdm/data/expedia/destinations.csv -# equivalently (notice the spacing around -n2): -head -n2 /anvil/projects/tdm/data/expedia/destinations.csv - -# print the last line of destinations.csv -tail -n 1 /anvil/projects/tdm/data/expedia/destinations.csv ----- - -[NOTE] -==== -Observe that often, the first line of a data file contains the titles of the columns in the data -==== - -Additionally (although not as often), we will want to view the contents of a file in whole. For this, `cat` is perfect. Try running the below code, and observe it's effects: - -[source, python] ----- -%%bash - -# prints all the contents of the file readme-by_year.txt -cat /anvil/projects/tdm/data/noaa/readme-by_year.txt ----- - -For this question, we want you to do the following: - -. print the first 3 lines of `/anvil/projects/tdm/data/election/itcont1986.txt` -. print the last 2 lines of `/anvil/projects/tdm/data/craigslist/vehicles_clean.txt` -. print the contents of `/anvil/projects/tdm/data/noaa/status-by_year.txt` - -.Deliverables -==== -- The file contents requested above, and the commands to get them -==== - -=== Question 5 (2 pts) - -With these basic tools to look at the contents of a file covered, let's talk about two commands useful to discover more about the _size_ and _structure_ of our file: `wc` and `du`. - -`wc`, which stands for _word count_, is actually capable of much more than simply counting the words in a file! Take a look at some of the below examples, along with https://explainshell.com/explain/1/wc[this man page], for some ideas about the power of `wc`. - -[source, python] ----- -%%bash - -# prints newline, then word, then byte counts for 2012.csv -wc /anvil/projects/tdm/data/stackoverflow/processed/2012.csv - -# prints just word count for 2012.csv -wc -w /anvil/projects/tdm/data/stackoverflow/processed/2012.csv - -# prints just byte count for 2012.csv -wc -c /anvil/projects/tdm/data/stackoverflow/processed/2012.csv ----- - -Where `wc` examines the number of lines, bytes, or characters _within_ a file, `du` (which stands for disk usage) measures the total disc space occupied by files and directories. Again, review https://explainshell.com/explain/1/du[the man page for `du`] and the below examples, and then move onto the tasks for the final set of tasks for this project. - -[source, python] ----- -%%bash - -# print the number of bytes that all of the processed directory is taking up -du -b /anvil/projects/tdm/data/stackoverflow/processed - -# prints the number of kilobytes that the processed directory is taking up -du --block-size=KB /anvil/projects/tdm/data/stackoverflow/processed - -# prints the number of kilobytes that each file in the processed directory is taking up -du --block-size=KB -a /anvil/projects/tdm/data/stackoverflow/processed ----- - -. How many lines are there in `/anvil/projects/tdm/data/beer/beerfile1.csv`? (Hint: `wc -l`) -. What is the length of the longest line in `/anvil/projects/tdm/data/beer/beerfile1.csv`? (Hint: `wc` has another argument to do this!) -. What is the size of the `/anvil/projects/tdm/data/beer/` directory in megabytes? -. What is the size of each individual file in `/anvil/projects/tdm/data/beer/`, in megabytes? - -[NOTE] -==== -If you're struggling with some of these tasks, _please_ refer back to the man pages or the https://explainshell.com[explainshell] page for a given command for some _strong_ hints on where to go next. -==== - -.Deliverables -==== -- The sizes requested above, and the commands used to produce these sizes -==== - -== Submitting your Work - -Congratulations! With this project complete, you're now familiar with all of the basics of the command line! With these tools in your belt, you can now explore, analyze, and manipulate a large part of Anvil at your whims! Please don't use your newfound powers for evil though... - -In the next project, we'll be building on these more primal analysis tools by introducing some more complex commands that allow us to perform specific search-and-return processes on data. From there, the sky is the limit, and we will be ready to dive into one of the most useful and important concepts in all of code: *pipelines* - -.Items to submit -==== -- firstname-lastname-project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2.adoc deleted file mode 100644 index 8047cd931..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project2.adoc +++ /dev/null @@ -1,215 +0,0 @@ -= TDM 20100: Project 2 -- Working with the bash shell - -**Motivation:** In the previous project we became (re-)familiarized with working on Anvil, before diving straight into using the bash shell (the command line interface). By learning to create, destroy, and move files and directories, along with some basic commands to begin to analyze files, we will be well on our way to performing some primitive forms of data analysis, using nothing but the terminal! - -**Context:** The ability to use bash shell commands such as `cat`, `cd`, `du`, `ls`, `mv`, `pwd`, `rm`, `sort`, `uniq`, `wc`, to get familiar with the bash shell, and get a basic understanding of the `man` (manual) pages, will enable you to see some of the power and speed of using the bash shell. - -**Scope:** Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem manipulation - -.Learning Objectives: -**** -- Learn how to create and destroy files from the CLI -- Learn how to create and destroy directories from the CLI -- Learn how to move files and directories around the filesystem -- Learn about basic file analysis commands -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) -- `/anvil/projects/tdm/data/election` (election data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) - -== Questions - -=== Question 1 (2 pts) - -In your `$HOME` directory, you can store only 25GB of data, but in your `$SCRATCH` directory, you can store up to 200TB of data. - -Your `$SCRATCH` directory is not intended for long-term storage, and it can be erased by the system administrators at regular points in time. Nonetheless, it can be very helpful for working on data sets that do not need to be stored for a long time. - -Your project templates (and all of your Jupyter Lab files) should be stored in your `$HOME` directory, but it is OK to put temporary data files into your `$SCRATCH` directory. - -Make a new file called `myflights.csv` in the `$SCRATCH` directory that has only the first line of the `1987.csv` file. - -Now take all of the `csv` data files called `1987.csv` through `2008.csv` from the `/anvil/projects/tdm/data/flights/subset/` directory and add their rows of data, one at a time, to the `myflights.csv` file. Be sure to *not* add the headers of these files. To accomplish this, use the `grep` command with the `-h` and `-v` options. (The `-h` option is used to hide the name of the file in the results, and the `-v` option is used to avoid any lines of the files that have the word "Year".) To append data to the end of a file, use ">>". - -[IMPORTANT] -==== -`mycommand myfile1.txt >myfile2.txt` will run `mycommand` on `myfile1.txt` and will save the results as `myfile2.txt`, destroying whatever was previously in `myfile2.txt`. - -In contrast, `mycommand myfile1.txt >>myfile2.txt` will run `mycommand` on `myfile1.txt` and will append the results to the end of `myfile2.txt`, without destroying whatever was previously in `myfile2.txt`. -==== - -Now check that the resulting file has the correct number of lines. - -The original files `1987.csv` through `2008.csv` have a total of 118914480 lines. - -The file `myflights.csv` has all of these lines, except for the 22 header lines from the 22 respective files, plus it has the header from the `1987.csv` file. So it should have a total of 118914480 - 22 + 1 = 118914459 lines. - - -Note: `wc`, which stands for _word count_, is actually capable of much more than simply counting the words in a file! Take a look at some of the below examples, along with https://explainshell.com/explain/1/wc[this man page], for some ideas about the power of `wc`. The `wc` command gives the number of lines, bytes, or characters _within_ a file. - -[source, python] ----- -%%bash - -# prints line count, then word count, then byte count for `2012.csv` -wc /anvil/projects/tdm/data/stackoverflow/processed/2012.csv - -# prints just the line count for `2012.csv` -wc -l /anvil/projects/tdm/data/stackoverflow/processed/2012.csv - -# prints just the word count for `2012.csv` -wc -w /anvil/projects/tdm/data/stackoverflow/processed/2012.csv - -# prints just the byte count for `2012.csv` -wc -c /anvil/projects/tdm/data/stackoverflow/processed/2012.csv ----- - -Another note: The `du` command (which stands for disk usage) measures the total disc space occupied by files and directories. Again, review https://explainshell.com/explain/1/du[the man page for `du`] and the below examples, and then move onto the tasks for the final set of tasks for this project. - -[source, python] ----- -%%bash - -# print the number of bytes that all of the processed directory is taking up -du -b /anvil/projects/tdm/data/stackoverflow/processed - -# prints the number of kilobytes that the processed directory is taking up -du --block-size=KB /anvil/projects/tdm/data/stackoverflow/processed - -# prints the number of kilobytes that each file in the processed directory is taking up -du --block-size=KB -a /anvil/projects/tdm/data/stackoverflow/processed ----- - -++++ - -++++ - - -.Deliverables -==== -- Show the output from running `wc $SCRATCH myflights.csv` (which will demonstrate that you produced a file with 118914459 lines). -- Show the head of the file, namely: `head $SCRATCH myflights.csv` (which should have the header and the data about 9 flights from 1987). -- As *always*, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will *document every single question with your comments and your insights*. -==== - -=== Question 2 (2 pts) - -Sometimes we want to copy files directly. Let's create a new directory in our `$SCRATCH` folder and copy all of those files with flight data (`1987.csv` through `2008.csv`) into that directory. Call the directory `myfolder`. Inside that folder, after those files are copied, build another file (like in Question 1) called `myflightsremix.csv`. Finally, compare these two files, using `cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` (if the files are exactly the same, there should be no output because the files have no differences). Also compare them by running: `ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` which should demonstrate that they are the same size. Check `wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` to ensure that they have the same number of lines, words, and bytes. - -Now go back to the scratch directory and remove this folder and its contents, using: `cd $SCRATCH` and then `rm -r $SCRATCH/myfolder` - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - - -.Deliverables -==== -- Show the output of: `cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` (which should be empty output, i.e., it should not do anything, because these files should have no differences) -- Show the output of: `ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` (which should demonstrate that they are the same size) -- Show the output of: `wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv` (to ensure that they have the same number of lines, words, and bytes) -- Then throw away the folder `$SCRATCH/myfolder` and finally show `ls -la $SCRATCH` to demonstrate that the folder `$SCRATCH/myfolder` is gone! -==== - -=== Question 3 (2 pts) - -Copy the files `itcont1980.txt` through `itcont2024.txt` from the directory `/anvil/projects/tdm/data/election` into your `$SCRATCH` directory. Then create a new directory called `mytemporarydirectory` in your `$SCRATCH` directory and move all of these election files into that new directory. Finally, put the content from all of these election files into a new file called `myelectiondata.txt`. Check the size of this new file using the `wc` command. When you are finished, it is OK to remove the directory `myelectiondata` from the `$SCRATCH` directory. - -++++ - -++++ - -.Deliverables -==== -- Show the output of: `wc mytemporarydirectory/myelectiondata.txt` (which should show that the file has 229169299 lines and 1385963208 words and 42790681570 bytes). -==== - - -=== Question 4 (2 pts) - -Extract the Origin and Destination columns from all of the files `1987.csv` to `2008.csv` in the directory `/anvil/projects/tdm/data/flights/subset`. Save these origins and destinations into a file called `$SCRATCH/myoriginsanddestinations.txt` - -Then sort this data and save the results to: `$SCRATCH/mysortedoriginsanddestinations.txt` - -Then use the `uniq -c` command to get the counts corresponding to the number of times that each flight path occurred: `$SCRATCH/mycounts.txt` Note: you need to sort the file before using `uniq -c` - -Now sort the file again, this time in numerical order, using `sort -n` and save the results to `$SCRATCH/mysortedcounts.txt` - -Finally display the `tail` of the file, which contains the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths. - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -.Deliverables -==== -- Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths. -==== - -=== Question 5 (2 pts) - -Use the `cut` command with the flags `-d, -f7` to extract the `STORE_R` values from this file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Then use the techniques that you learned in Question 4, to discover how many times that each of the `STORE_R` values appear in the file. - -++++ - -++++ - -++++ - -++++ - - -.Deliverables -==== -- List the number of times that each of the `STORE_R` values appear in the file. -==== - -== Submitting your Work - -Congratulations! With this project complete, you're now familiar with many of the basic uses of the command line! With these tools in your belt, you can now explore, analyze, and manipulate a large part of Anvil at your whims! Please don't use your newfound powers for evil! - -In the next project, we'll be building on these more primal analysis tools by introducing some more complex commands that allow us to perform specific search-and-return processes on data. From there, the sky is the limit, and we will be ready to dive into one of the most useful and important concepts in all of code: *pipelines*. More to come! - -.Items to submit -==== -- firstname-lastname-project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3-teachinglearning-backup.adoc deleted file mode 100644 index 48835111d..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3-teachinglearning-backup.adoc +++ /dev/null @@ -1,214 +0,0 @@ -= TDM 20100: Project 3 -- Pattern Matching and `grep` - -**Motivation:** Since the release of Google in 1998, it's popularity has exploded to the point where it has become a household name. Even when someone doesn't literally mean "Search on Google.com", the phrase "Google it!" is often used as a universal call to search for a word or phrase. While we've already learned a lot about traversing and manipulating filesystems through a CLI, we still have another big hurdle to cross: finding specific files. In the CLI, we can't simply "Google it". Instead, we use pattern-matching tools like `grep` to search for the things we want. This project will be dedicated to exploring the basics of `grep` and the idea of pattern-matching and regular expressions more generally. - -**Context:** Being able to perform basic filesystem navigation and manipulation, as learned in the previous two projects, will be vital for completing this project - -**Scope:** Pattern-matching, `grep`, Regular Expressions, Bash, CLI - -.Learning Objectives: -**** -- Learn about the basic concept of pattern-matching -- Use `grep` to find specific data in a file -- Learn about the basic idea of RegEx and its usage -- Use `cut` to isolate certain parts of data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/olympics/athlete_events.csv` -- `/anvil/projects/tdm/data/icecream/combined/products.csv` - -== Questions - -=== Question 1 (2 pts) - -Before we dive into `grep` and regular expressions, let's take a second to understand the concept of pattern-matching more generally. Pattern-matching always involves two central things: a pattern to search for, and some string of data or files to search for that pattern in. - -While this idea seems rather straightforward on its face, it does contain a lot of depth. For example, let's say that we are searching for the pattern "aa" In a string of characters "aabbabbbaaa". There are a few different ways we could interpret this simple pattern, shown below: - -- We only want to find the first instance of "aa", so we find **aa**bbabbbaaa -- We don't care if "aa" has some letters between it, so we can now recognize "abbba" as a match as well -- We don't care if patterns overlap, meaning "aaa" actually contains two different "aa" pairs within it -- We only want to recognize a "match" when it has "aa" in it exactly _n_ times, where _n_ is some arbitrary integer - -As you can see, there is an _enormous_ number of different interpretations for pattern matching, so much so that giving a pattern of simply "aa" is just not specific enough. To solve this issue, regular expressions were developed! Regular expressions, (often referred to simply as regex or regexp) provide us with a concise way to create very specific patterns. - -For example, if we wanted to search for the word "pizza" occurring at least once, we could use the pattern `(pizza)+`. If we wanted pizza to occur 5 times, we could instead use the pattern `(pizza){5}`. If we wanted to search for anything that contains "pizza" or "hamburger" at least once, we could use "(pizza|hamburger)+". - -Let's take a look at another brief example, which will also serve to introduce `grep` (although we'll wait on explaining until this next question!) and use a regex to search for the letter "a": - -[source, Python] ----- -%%bash -# create file for example -cd $SCRATCH -echo "Hello amigo!" > ex1.txt -echo "This line doesn't hold the first letter of the english lexicon!" >> ex1.txt -echo "But this one can and does, just more than once!" >> ex1.txt - -grep -E "a" ex1.txt - -# remove example files -rm ex1.txt ----- - -Give https://cheatography.com/davechild/cheat-sheets/regular-expressions/[this excellent Reg Ex reference] a quick glance, and use it to attempt to answer the following question. We will also provide a test case for you to use to verify your answer below. - -Create a pattern to search for "passwordXXX" where "XXX" can be any 3 digits from 0-9. If you do it correctly, the below example should verify your work. - -[NOTE] -==== -https://regex101.com/[This website] is a great resource for building and checking your regular expressions, and we would encourage that your first build and test your expressions here before trying them on the command line. -==== - -[source, Python] ----- -%%bash -# create file for test -cd $SCRATCH -echo "PassSSSssWord" > test1.txt -echo "This line doesn't hold thepassword903first letter of the english lexicon!" >> test1.txt -echo "But this one password 232 can and does!" >> test1.txt -echo "Your answer: " - -grep -E "INSERT YOUR PATTERN HERE" test1.txt - -echo "-----------------------------" -echo "Correct answer: " -echo "This line doesn't hold thepassword903first letter of the english lexicon!" - -# remove example files -rm test1.txt ----- - -[IMPORTANT] -==== -Solely for this question, you may not understand all that's going on (i.e. the presence of the `-E` argument to grep). That is okay, as long as you insert your pattern where specified and get the correct answer. We will go into much greater detail on these arguments later in the project. -==== - -.Deliverables -==== -- The correct RegEx pattern as specified above -==== - -=== Question 2 (2 pts) - -As you're likely realizing, pattern-matching is a ridiculously deep concept and as such will take time to master. Along with all the other concepts we've learned so far in this class, we will continue to explore and apply them to real data and deepen our understanding through repeated usage and development of practical skills. - -As you saw in the last question, our regex patterns are only part of our search, and are actually an argument passed to `grep`. In this question, we'll further explore `grep`, some common flags we pass to it, and how it can be used for efficient searching of the filesystem. - -First off, what is `grep`? `grep` stands for Global Regular Expression Print and searches through a series of characters, lines, files, or directories for a given regex pattern, returning its results in a variety of forms as desired by the user. - -Let's talk about the **E**lephant in the room: The `-E` flag we passed to `grep` in the last question. While regex standards are typically _very_ similar between languages, there are slight differences that lead to what we call different _flavors_ of regex. `-E` specifies that we want to use the https://www.techtarget.com/whatis/definition/POSIX-Portable-Operating-System-Interface[POSIX]-specified Extended Regular Expressions standard. The other major option for `grep` regex is `-P`, which specifies the PCRE2 regex flavor (used in languages like Perl and PHP). For the questions in this project, we won't mind if you use `-P` or `-E`, but it is our suggestion that you pick one and stick to it. `-P` is largely considered to be more powerful and flexible than `-E`, and also has full compatibility with https://regex101.com/[This regex tester], but again it is almost entirely a matter of personal preference and either of these should work equally well for the purposes of this project. - -For this question, we'll again have you use grep to search a basic file, this time exploring some of the other arguments you can pass to `grep`. Before attempting this question, please read the man page for `grep` either straight from the terminal or by visiting https://manpages.ubuntu.com/manpages/oracular/en/man1/grep.1posix.html[this website], and take note of some of the arguments for `grep` that may assist you in this question. - -.. In the file `/anvil/projects/tdm/data/olympics/athlete_events.csv`, on what line does the string "Bashir Abdullah Abdul Aziz" occur? -.. In how many lines of the file `/anvil/projects/tdm/data/olympics/athlete_events.csv` does the string "Mahmoud Ahmed Abdin" occur? - -[NOTE] -==== -For some additional help on each of the two sub-problems for this question, look specifically at the `-n` and `-c` options for `grep`, respectively. -==== - -.Deliverables -==== -- A `grep` for the line number for "Bashir Abdullah Abdul Aziz" in `/anvil/projects/tdm/data/olympics/athlete_events.csv` -- A `grep` for the number of occurrences of Mahmoud Ahmed Abdin in `/anvil/projects/tdm/data/olympics/athlete_events.csv` -==== - -=== Question 3 (2 pts) - -With the clear power of `grep` for searching a file realized, let's continue to expand on it by searching entire directories at once! Try running the below example, which uses the `-r` flag to `grep` to tell it to recursively search for our pattern throughout all the files in the given directory, and output the names of those files along with the line number of any matches (using `-n`) and the text that matched our regex (using `-o`). - -[source, Python] ----- -%%bash -grep -Prno "hello world" "/anvil/projects/tdm/data/techcrunch" ----- - -For this question, we want you to perform a very similar `grep`, this time to tell us how many times lines the phrase "SUGAR" appears on in the `/anvil/projects/tdm/data/icecream/combined/products.csv`, no matter what is around it. (Hint: `-r` is not necessary to complete this question.) - -You'll know you've correctly solved the question if your `grep` for `combined/products.csv` outputs `237`. - -.Deliverables -==== -- A `grep` to find the number of lines that contain "SUGAR" in the file `/anvil/projects/tdm/data/icecream/combined/products.csv` -==== - -=== Question 4 (2 pts) - -It's good that we know have the incredible power of regex at our disposal when searching through files. However, even with this ability it can still be difficult to search for specific information in files where each line could be hundreds or even thousands of characters long, which is very common in data files ending with `.csv`. Luckily, we have yet another extremely useful command that can help us with this: `cut`. - -`cut` allows us to, well, _cut_ a line into a bunch of pieces and select the piece we want. Try running the below code for a concrete example of this at work, and give https://explainshell.com/explain/1/cut[the man page for `cut`] a read before you attempt the below problems. - -[source, python] ----- -%%bash -# navigate to the appropriate directory -cd $SCRATCH - -# take a look at the first 2 rows of our data file -# (the first row is the column headers) -head /anvil/projects/tdm/data/youtube/USvideos.csv -n2 - -# store the first two rows of our data file to a file in our SCRATCH directory -head -n2 /anvil/projects/tdm/data/youtube/USvideos.csv > USvids_sample.csv - -# cut each line on commas (-d ","), then grab the 3rd field from each line (-f3) -cut -d "," -f3 USvids_sample.csv ----- - -Try changing the `-n2` in `head -n2 /anvil/projects/tdm/data/youtube/USvideos.csv > USvids_sample.csv` to `-n10` or `-n50` to show what happens when you use `cut` on the first 10 and 50 lines of `USvideos.csv`, respectively. Additionally, try changing the `-f3` argument to `cut` to `-f6` or `-f9` to see the `publish_time` or `likes` fields of the lines, respectively. - -For this question, we want you to write a command using `cut` to get the names of the first 50 channel titles in the file `/anvil/projects/tdm/data/youtube/USvideos.csv`. If you're having trouble starting this, we would suggest that you use the provided example above as a starting point. It is _very_ close to the solution already, and the first instance of `head` will print the names of all the comma-separated columns, telling you exactly which field you'll need to get using `cut`. - -.Deliverables -==== -- A `cut` to get the first 50 channel names out of `/anvil/projects/tdm/data/youtube/USvideos.csv` -==== - -=== Question 5 (2 pts) - -For this last question, we're going to combine our `cut` and `grep` skills, along with providing a small tease at a new tool we'll spend the next two projects learning about and using in depth: `piping`. For now, you don't have to fully understand piping and can just imagine it at its simplest: taking the output from the first command and giving it as an input to the second command. - -Let's look at the below example: - -[source, python] ----- -%%bash -cut -d "," -f3 /anvil/projects/tdm/data/youtube/USvideos.csv | grep -Ec "[Aa]pology" ----- - -Now let's deconstruct each of the above actions. First, we cut our line on `,` and select the third field, which we know from the previous question is the title of the Youtube video for each line. Next, we use `grep` to search for the pattern `[Aa]pology` case-insensitive, and return the count of lines with that pattern in the title field. As a result, we can see that there are 16 videos containing "Apology" in the title in our `USvideos.csv` data. That's less apology videos than I thought there would be! - -Your task for this question is rather basic, as we're asking you to combine both of the commands that you learned about in this project using a new tool we haven't yet discussed in-depth (piping). Modify the above example to search for how many times a channel named "HowToBasic" appears in the `channel_title` field of `/anvil/projects/tdm/data/youtube/USvideos.csv`. (Hint: Your code's answer to this should be 16.) - -As an added test, do the same, but this time for the `channel_title` "The Tonight Show Starring Jimmy Fallon". - -.Deliverables -==== -- A `cut` and `grep` to count the number of times "HowToBasic" appears in `/anvil/projects/tdm/data/youtube/USvideos.csv` as a `channel_title` -- - A `cut` and `grep` to count the number of times "The Tonight Show Starring Jimmy Fallon" appears in `/anvil/projects/tdm/data/youtube/USvideos.csv` as a `channel_title` -==== - -== Submitting your Work - -Congratulations, with regular expressions in your toolset, you can now show your mom and dad a string like `\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b` and explain to them how what looks like complete nonsense is actually how we can search for emails (a famous and notoriously difficult problem solved with regex)! As we move forward in this semester's curriculum, continue to think about how regular expressions and pattern-matching incorporate into data science generally, and feel free to refer back to previous projects from TDM 101-102 and ask questions about how languages like Python and R might be utilizing regex behind-the-scenes for some of their built-in functions! - -.Items to submit -==== -- firstname-lastname-project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3.adoc deleted file mode 100644 index d1db72450..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project3.adoc +++ /dev/null @@ -1,148 +0,0 @@ -= TDM 20100: Project 3 -- Pipelines - -**Motivation:** In the previous project, at each stage in our analysis, we saved the output to a file. A more efficient method is to take the output from one command and use it as the input to the next command. This is called a pipeline of bash commands. - -**Context:** Once we learn how to write bash commands in a pipeline, we can (more easily) use several bash commands in tandem. - -**Scope:** Pipelines in Bash - -.Learning Objectives: -**** -- Learn about the concept of bash pipelines, to use several bash commands in sequence. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) -- `/anvil/projects/tdm/data/election` (election data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) - -== Questions - -=== Question 1 (2 pts) - -In the previous project, on this collection of files: - -`/anvil/projects/tdm/data/flights/subset/[12]*.csv` - -we ran the following commands in bash: - -[source, bash] ----- -`cat` (to print the files; we did this in Project 2, Question 1) -`cut -d, -f17,18` (to extract the 17th and 18th fields, for the Origin and Destination columns) -`sort` (to get all of the same flight paths next to each other in the file) -`uniq -c` (to discover how many times that each flight path occurs) -`sort -n` (to numerically sort the number of times that the flight paths occur) -`tail` (to get the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths) ----- - -Now we can do all of this together, in one long line: - -[source, bash] ----- -cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | cut -d, -f17,18 | sort | uniq -c | sort -n | tail ----- - -(To simplify things, we are not removing the head of each file.) - -[IMPORTANT] -==== -Please use 3 or 4 cores when working on this question. -==== - -++++ - -++++ - -.Deliverables -==== -- Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths. -==== - -=== Question 2 (2 pts) - -In the previous project, from this file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -we discovered how many times that each of the `STORE_R` values appear in the file, using the following commands in bash: - -[source, bash] ----- -`cut -d, -f7` (to extract the `STORE_R` values from this file) -`sort` (to get all of the same `STORE_R` values next to each other in the file) -`uniq -c` (to discover how many times that each `STORE_R` value occurs) -`sort -n` (to numerically sort the number of times that the `STORE_R` values occur) ----- - -Now we can do all of this together, in one long line: - -[source, bash] ----- -cut -d, -f7 /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv | sort | uniq -c | sort -n ----- - -++++ - -++++ - -.Deliverables -==== -- List the number of times that each of the `STORE_R` values appear in the file. -==== - -=== Question 3 (2 pts) - -Using a pipeline to discover the 10 states in which the largest number of donations have been made (and the number of donations from each of these states), using the data stored in: - -`/anvil/projects/tdm/data/election/itcont*.txt` - -[HINT] -==== -The data can be extracted from the 10th field of the files. The symbol "|" is the delimiter. So the cut command should look like: `cut -d'|' -f10` -==== - - -.Deliverables -==== -- The 10 states in which the largest number of donations have been made (and the number of donations from each of these states) -==== - -=== Question 4 (2 pts) - -Modify your solution to Question 3 so that you extract both the city and the state (simultaneously) for each donation. In this way, you can discover the 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs). - -.Deliverables -==== -- The 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs). -==== - -=== Question 5 (2 pts) - -Return to the analysis of the airline data. Modify your solution to Question 1 so that, *instead of* extracting the Origin and Destination airports, this time you can extract three columns: the year, month, and day of the flights. In this way, you can discover the 10 most popular days to fly from 1987 to 2008, i.e., the 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates). - -.Deliverables -==== -- The 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates). -==== - -== Submitting your Work - -Congratulations, with your understanding of pipelines, you are ready to leverage the strength of many bash commands in a sequence! Please feel encouraged to refer back to previous projects and ask questions (anytime) about how you can use bash for powerful and easy data analysis! - -.Items to submit -==== -- firstname-lastname-project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4-teachinglearning-backup.adoc deleted file mode 100644 index 772db1134..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4-teachinglearning-backup.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= TDM 20100: Project 4 -- Intro to Piping - -**Motivation:** In the past 3 projects, you've built up a truly impressive set of CLI skills that have drastically increased your ability to work with files in the terminal. However, even with all these different tools and commands at your disposal, each step you take to process some data at this point is quite fragmented. If you wanted to `cut` and then use `grep`, your choice up until now (excepting the last question of project 3) was to simply save the results of your `cut` to a file and then use `grep` on that file. So far, that hasn't caused any major issues. If you had a file with million, billions, or even trillions of lines of data, though, you would have a serious problem. This project will begin our investigation into an elegant and efficient method of solving this fragmentation problem: piping. - -**Context:** This project will incorporate commands learned in all the previous ones, so reviewing projects 1-3 may help with completion of this project. - -**Scope:** Pipes, pipelines, data processing, Bash, GNU, CLI - -.Learning Objectives: -**** -- Understand the basic concept of piping and pipelines -- Use the pipe operator to connect two basic commands -- Build your first data processing pipeline -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/youtube/USvideos.csv` -- `/anvil/projects/tdm/data/flights/*` - -== Questions - -=== Question 1 (2 pts) - -In TDM 101 and 102, you learned about Python and R, two of the most popular programming languages when it comes to data science. As you go through this project, you will likely see many parallels to things we did in those classes using R functions or `pandas` and wondering why we're bothering to do some of the same things using a different language, bash. Bash gives us easy access to highly optimized tools dedicated to a specific purpose. Some of those specialized tools can run orders of magnitude faster while simultaneously being easier to write than similar code written in R or python. Bash is far more limited in the scope of what it can do than Python and R, however. Python and R provide us a high degree of flexibly when processing complex data like integers, datetimes, and more. Bash is often used to prepare your data for more complex processing with Python and R. It is important that your portfolio of skills as a data scientist is wide and varied so that you will be able to choose the correct tool for a given problem. - -To start this project, let's look at piping in its most basic sense. Run the following 2 code snippets, and observe their outputs: - -[source, python] ----- -%%bash -cd ~/../../anvil/projects/tdm/data/youtube -ls ----- - -[source, python] ----- -%%bash -cd ~/../../anvil/projects/tdm/data/youtube -ls | head -n3 ----- - -Now let's examine in detail what's going on. In the first code snippet, we navigate to our `youtube` directory and use `ls` to list the files in that directory. In the second code snippet, we do the same `ls` as before, but then we use the pipe operator `|` to send the _output_ of `ls` to become the _input_ for `head`, which results in us only getting the first three outputs from `ls`! - -For this question, we just want you to get used to using `|` by doing something almost identical to the above code snippet. Using `ls`, `head`, and `|`, output just the first 7 files/directories in `/anvil/projects/tdm/data/flights`. - -.Deliverables -==== -- An `ls` piped to a `head` to display the first 7 files/directories in `/anvil/projects/tdm/data/flights` -==== - -=== Question 2 (2 pts) - -For these next questions, we'll be having you apply skills you learned in previous projects, then introducing piping to establish a more efficient way to solve the same problems. - -To start, take a look at the first 5 rows of data for this project, which can be found at `/anvil/projects/tdm/data/youtube/USvideos.csv` (Hint: `head`), and store those 5 rows to a file in your `$SCRATCH` directory called `vid_head.csv`. - -Next, use `cut` on `vid_head.csv` to get just the title of each channel (the 4th field in the comma-separated data), and store those titles to a file in your `$SCRATCH` directory called `channels.csv`. - -Finally, use `cat` to print the contents of `channels.csv`. - -[NOTE] -==== -Remember from previous projects, the `>` redirect operator can be used to put the output of a command into a given file. -==== - -.Deliverables -==== -- Command to store the first three lines of `/anvil/projects/tdm/data/youtube/USvideos.csv` to `$SCRATCH/vid_head.csv` -- Command to store just the titles of the videos from `$SCRATCH/vid_head.csv` to `$SCRATCH/channels.csv` -- Command to store the first three lines of `/anvil/projects/tdm/data/youtube/USvideos.csv` to `$SCRATCH/channels.csv` -==== - -=== Question 3 (2 pts) - -Let's concisely summarize what we just did: - -- We got the `head` of our data, and stored it to a file -- We cut the data from that file, and stored it to a new file -- We outputted the results of the file - -While this works, it is by no means pretty nor efficient. Let's try to do the same thing, but this time using piping and without storing to any intermediary files along the way. Feel free to attempt this on your own now, or continue reading for some more in-depth instructions. - -First, we want to use `head` to get the first 5 rows of data for our `USvideos.csv` data. Next, we want to use `|` to pipe the output of `head` to the `cut` command we wrote in the last question. Remember, because we're piping the output of `head` to `cut`, we shouldn't include the name of an input file to our `cut` command, as the output from `head` will be used instead. If you did everything correctly, they should match the output you got in the last problem but this time without making a whole bunch of temporary files and breaking everything into unnecessarily discrete steps! - -[IMPORTANT] -==== -At this point, you may want to `rm $SCRATCH/vid_head.csv` and `rm $SCRATCH/channels.csv`. While it won't immediately break anything if you don't, it is always good practice to keep your directories tidy and only containing necessary data. -==== - -.Deliverables -==== -- The list of commands from Question 2, collapsed into one line using piping without storing to any files. -==== - -=== Question 4 (2 pts) - -Briefly, we should discuss the idea of _pipelines_ that has been underlying our actions thus far. A pipeline (in this context) is just another name for a series of commands, where the output of each command is piped to the input of another. This concept is one that is mirrored throughout all of data science, computer science, engineering, and just about any STEM-related field. In non-STEM related fields, pipelines like manufacturing and product transport are prevalent. The takeaway from all of this is that the logical concept of a pipeline is extremely powerful, and it is an effective idea to try and translate problems that you encounter in data science into a pipeline of smaller steps towards your final solution. - -[NOTE] -==== -For a videogame that heavily emphasizes the importance of planned and efficient pipelines, the author of this project recommends https://factorio.com/[Factorio] as a personal favorite. -==== - -With our semantic understanding now established, let's return to our pipeline from the last problem. Copy your completed pipeline from question 3, and add a call to `wc` so that it tells us the number of lines in our output. You should note that it should match the number of lines that `head` is outputting, at the moment. - -Modify this pipeline such that you are getting the titles of each channel for every line in `USvideos.csv`, then using `uniq` to get rid of any duplicates, and finally using `wc` to count the number of unique channels in our data. - -[IMPORTANT] -==== -You may have noticed some strange quotations around certain titles in our data. While it is good to note that these are there, they will not affect your answer to this question and you don't have to deal with them right now. -==== - -Then run your pipeline again in another code cell, this time without using `uniq`. How many duplicates were there in our data? Is this surprising? - -[NOTE] -==== -For more information about how to use `uniq`, we would recommend you view https://explainshell.com/explain/1/uniq[its man page]. As `wc` was covered in Project 2, your work for that project may be helpful when trying to figure out this question. -==== - -.Deliverable -==== -- A pipeline that counts the number of unique channel titles in `/anvil/projects/tdm/data/youtube/USvideos.csv` -- A pipeline that counts the number of total channel titles in `/anvil/projects/tdm/data/youtube/USvideos.csv` -- The number of duplicate channel titles calculated based on the results of your two pipelines -==== - -=== Question 5 (2 pts) - -Using a slight variation on the pipeline you built in the last question, count how many tail numbers (the 11th field when you cut on ",") there are in `/anvil/projects/tdm/data/flights/2023.csv`, and compare it to how many unique ones there are. Notice that both answers are over 2 million. If we tried to do something like this in Python or R, simply loading the data normally takes about 10 seconds. Using bash, we can do the whole thing in less than 5! - -Finally, calculate how many duplicate tail numbers there are in the data, using the outputs of your two individual pipelines. - -.Deliverables -==== -- The number of tail numbers in `2023.csv` -- The number of unique tail numbers in `2023.csv` -- The number of duplicate tail numbers in `2023.csv` -==== - -== Submitting your Work - -With this question complete, you've successfully completed The Data Mine's introduction to piping and pipelines! While this project was syntactically quite simple, the concepts at play are hugely important and complex. As we continue on the next few projects building some more complex pipelines, continue to think about how we're breaking down large problems into groups of smaller ones, making the processing of the data both easier to perform and more readable. - -.Items to submit -==== -- firstname-lastname-project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4.adoc deleted file mode 100644 index ed809b087..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project4.adoc +++ /dev/null @@ -1,177 +0,0 @@ -= TDM 20100: Project 4 -- Pattern matching in pipelines - -**Motivation:** We have begun to learn how to build pipelines in bash. Now we will integrate pattern matching into bash pipelines. - -**Context:** Pattern matching, used as part of a pipeline, is a very powerful technique. - -**Scope:** Pipelines and pattern matching in Bash - -.Learning Objectives: -**** -- Learn about using pattern matching in bash pipelines. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) -- `/anvil/projects/tdm/data/election` (election data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) -- `/anvil/projects/tdm/data/beer/breweries.csv` (breweries data) -- `/anvil/projects/tdm/data/icecream/combined/reviews.csv` (ice cream review data) - -== Questions - -=== Question 1 (2 pts) - -In Project 1, question 4, we learned how to find lines of the file that contain the pattern "IND" (these are the flights that have origin or destination at the Indianapolis airport). - -In Project 3, question 1, we extracted all of the origin and destination airports, using one pipeline of bash commands. - -Revisit Project 3, question 1, but this time, use `grep IND` early in your bash pipeline, so that your results *only* show flights that either have origin or destination at Indianapolis. The goal is to show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths. - -[HINT] -==== -You want the top 10 most popular flight paths that are either to Indianapolis or from Indianapolis. The top 2 most popular flight paths that are either to Indianapolis or from Indianapolis (which should be the last two lines of your ten lines of output) are: - -[source, bash] ----- - 76554 ORD,IND - 77720 IND,ORD ----- - -In this question, you will find all 10 such flight paths. Besides ORD (which is Chicago O'Hare), the other popular flight paths include these airports: ATL,DFW,DTW,MSP,STL. -==== - -.Deliverables -==== -- Show the 10 most popular flight paths to-or-from Indianapolis (during the years 1987 to 2008), and the number of times that airplanes flew on each of these flight paths. (We already gave you 2 of these 10 in the hint!) -==== - -=== Question 2 (2 pts) - -Revisit Project 3, Question 2, to study the grocery store data, from this file: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -This time, limit your analysis to only the `EAST` stores, using the `grep` command. Use the `cut` command to extract the `PURCHASE_` date, and see how many times each `PURCHASE_` date occurs. What are the 10 most "popular days" for shopping at the `EAST` stores? "Popular days" are measured by how many times that each date appears in the file. (You can ignore the `SPEND` and `UNITS` columns in this question; only pay attention to how many times that each date occur at the `EAST` stores.) Be sure to list the number of times that each date appears. - -[HINT] -==== -The two most popular dates at the `EAST` stores are: - -[source, bash] ----- - 8730 23-DEC-16 - 8889 23-DEC-17 ----- -==== - -.Deliverables -==== -- What are the 10 most "popular days" for shopping at the `EAST` stores? -==== - -=== Question 3 (2 pts) - -Consider the election files: - -`/anvil/projects/tdm/data/election/itcont*.txt` - -Using the `cut` command with '|' as the delimiter, cut out the 8th field, which are the names of the donors. (Be sure to cautiously use `head` whenever you look at the output from a pipeline, so that you do not print millions of rows of output.) - -Now add to the pipeline another `cut` command with a comma as the delimiter, cut out the 1st field, which will be the family name of the donors. (Again, be sure to carefully use `head` whenever you look at pipeline output.) - -Finally, finish the pipeline, to get the 10 most popular last names of donors (be sure to print how many times each of these 10 most popular last names occur). - -[TIP] -==== -You do *not* need to use `grep` on this question. -==== - - -[HINT] -==== -The two most popular last names of donors are: - -[source, bash] ----- -1225266 JOHNSON -1654933 SMITH ----- -==== - - -.Deliverables -==== -- The 10 most popular last names of donors (and the number of times each of these 10 most popular last names occurs). -==== - -=== Question 4 (2 pts) - -In Project 3, Question 4, you extracted the city and state from the donation data. In this question, *instead of studying the election data*, consider (instead) the breweries data in this file: - -`/anvil/projects/tdm/data/beer/breweries.csv` - -Discover the 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs). - -[TIP] -==== -You do *not* need to use `grep` on this question. -==== - -[HINT] -==== -The two most popular city-and-state pairs for breweries are: - -[source, bash] ----- - 499 Philadelphia,PA - 512 Chicago,IL ----- -==== - - -.Deliverables -==== -- The 10 city-and-state pairs in which the largest number of breweries are located (and the number of breweries in each of these city-and-state pairs). -==== - -=== Question 5 (2 pts) - -In this ice cream reviews file: - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -We can use `grep salty` to see that 335 lines of this file have the word "salty". By default, `grep` pays attention to the case, so (for instance) these 335 occurrences do not include "Salty" or "SALTY" or "saLTy". If we want to search without paying attention to the case, we will get more occurrences of the pattern. In this case, `grep -i salty` allows us to see that 350 lines of this file have the word "salty" when we do not pay attention to the case. The "-i" stands for a case-insensitive search. - -Similarly, there are 1972 lines that include the exact pattern "sweet", but if use a case-insensitive search, there are 2080 lines that include the pattern "sweet" without paying attention to case. - -How many lines of the file include the exact pattern "chocolate"? - -How many lines of the file include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case? - -.Deliverables -==== -- The number of lines of the file that include the exact pattern "chocolate". -- The number of lines of the file that include the pattern "chocolate" as a case-insensitive search, in other words, without paying attention to the case. -==== - -== Submitting your Work - -You now have some experience using pattern matching inside pipelines of bash commands! Your skills from one project to the next are growing! Please refer back to previous projects, and ask questions anytime that you need advice or help! - -.Items to submit -==== -- firstname-lastname-project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5-teachinglearning-backup.adoc deleted file mode 100644 index 779c82114..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5-teachinglearning-backup.adoc +++ /dev/null @@ -1,169 +0,0 @@ -= TDM 20100: Project 5 -- Pipelines, Continued - -**Motivation:** In this project, we'll continue to cement the concept of pipelines into your toolset. We will gradually work up to more complex examples of piping and pipelines, culminating in complicated pipelines that provide loads of valuable data processing at once, along with some that begin to perform some data analysis! - -**Context:** This project will incorporate commands learned in all the previous ones, so reviewing projects 1-4 may help with completion of this project. - -**Scope:** Pipes, pipelines, data processing, Bash, GNU, CLI - -.Learning Objectives: -**** -- Build basic pipelines that reformat and analyze data -- Build complex pipelines that process data into appropriate formats for analysis -- Perform basic data analysis using pipelines in `bash` -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/youtube/USvideos.csv` -- `/anvil/projects/tdm/data/flights/*` - -== Questions - -=== Question 1 (2 pts) - -In the last project, we learned about the `|` pipe operator and began building our first pipelines to process data. In this project we will briefly review our work from the previous project before building some more complex pipelines. - -First, let's get familiar with one of the two datasets we'll be working with in this project. Using a bash command, print the first three rows of `/anvil/projects/tdm/data/youtube/USvideos.csv`. - -Then, piping the first 3 lines of the file into a `cut`, isolate just the `publish_time` field from the first 3 lines of the file. If done correctly, the output from your `cut` pipeline should look like this: - -[source, bash] ----- -publish_time -2017-11-13T17:13:01.000Z -2017-11-13T07:30:00.000Z ----- - -.Deliverables -==== -- Printed first 3 lines of the `USvideos.csv` file -- Isolated `publish_time` field from the first 3 lines of the file -==== - -=== Question 2 (2 pts) - -Purely from looking at the output of your code from Question 1, you likely have a good idea of what the `publish_time` field means. However, there are two oddities that may be throwing you off: `T` and `Z`. In order to store and retrieve data efficiently, _conventions_ are established that put forward rules about the format for storing data of different types. This datetime data is stored using the _ISO 8601_ standard, where the `T` is meant to mark the end of date data and the beginning of time data and the ending character (`Z`, in our case) denotes the time zone that the time data is being provided in. The `Z` in our case means that our provided time zone is UTC. - -[NOTE] -==== -The `Z` that represents UTC comes from "Zulu" time being the military name for UTC. -==== - -For this question, copy/paste your pipeline from the previous question. Then, using another `cut`, isolate just the dates for the first 10 lines of the file (including the column header line). If done correctly, your code's output should look like this: - -[source, bash] ----- -publish_time -2017-11-13 -2017-11-13 -23 -2017-11-13 -2017-11-12 -2017-11-13 -2017-11-12 -2017-11-12 -2017-11-13 ----- - -[IMPORTANT] -==== -You should take notice of the seemingly random `23` in our code's output. If we take a look at the line that this `23` comes from, we can see that the `title` of the video is `Racist Superman | Rudy Mancuso, King Bach & Lele Pons` **which contains a comma**! As we can see, performing a `cut` using a comma as our delimiter sometimes doesn't get the field we want or expect it to! In languages like Python and R, this is often handled behind-the-scenes, but on the CLI we will have to think about these issues and handle them more intentionally. In the next few questions, we'll do just that. -==== - -.Deliverables -==== -- The first 10 lines of the `publish_time` field from `USvideos.csv` -==== - -=== Question 3 (2 pts) - -Okay, so we've identified an issue with our current approach of isolating the dates from our current pipeline. Should we just steamroll ahead and remove all the errors in post? _Definitely not_!! Instead, let's refer back to a tool we learned about in project 3: `grep`! - -For the purposes of this question, you can assume that the string you are looking for in a given line is of the format `dddd-dd-ddTdd:dd:dd.dddZ` where `d` can be any digit between 0-9. Using `grep` and a regular expression of your own design, isolate the `publish_time` field from the first 10 lines of the file. - -As a short reminder, https://regex101.com[regular expression building website] can be extremely helpful when designing and creating a regular expression, and we encourage you to use them throughout your work. - -[NOTE] -==== -The `-o` flag to `grep` will print only the parts of a line that match a given pattern. Additionally, we recommend you tackle this problem using the Perl flavor of regex `-P`, but any working pattern will be accepted for full credit. -==== - -If done correctly, your output should look like: - -[source, bash] ----- -2017-11-13T17:13:01.000Z -2017-11-13T07:30:00.000Z -2017-11-12T19:05:24.000Z -2017-11-13T11:00:04.000Z -2017-11-12T18:01:41.000Z -2017-11-13T19:07:23.000Z -2017-11-12T05:37:17.000Z -2017-11-12T21:50:37.000Z -2017-11-13T14:00:23.000Z ----- - -If we use `wc -l` to count the total lines in the file (40950) and compare that to the total number of lines from running our `grep` on the file as a whole, we should see a total of 40949 matches. The one line with no match is the header line! Using your `grep` on the entire file, pipe the output into a `wc` to count the number of lines ouput. If you get an answer other than 40949, something is amiss. - -.Deliverables -==== -- A `grep` regular expression to retrieve the dates -- A `wc` counting the number of lines that matched our pattern -==== - -=== Question 4 (2 pts) - -We've now established a wicked-fast and tested way to extract every single `publish_time` field from our file. Let's now begin to analyze our isolated data! - -By building on the `grep` pipeline you created in the previous question, isolate just the dates from the `grep` output (Hint: `cut` with a delimiter of "T"). Then, use `cut` again to isolate just the months (the second field in the date). Finally, use `sort` and then `uniq` to count the number of occurrences of each month in our data. Which month saw the most videos published? Which month had the least videos published? Write your answers in a markdown cell. - -[NOTE] -==== -https://stackoverflow.com/questions/6044539/generating-frequency-table-from-file[This stackoverflow post] will help you figure out how `sort` can be used with `uniq` to be sure that you are getting a concise frequency table for each month in the data. -==== - -.Deliverables -==== -- Which months had the most and least videos published, and the code used to find this answer -==== - -=== Question 5 (2 pts) - -This last question serves as a build onto the complexity of the previous one, and should largely be copy-paste of your already-existing pipeline. Instead of the month that had the most videos published, we're interested in the specific day of the year that had the most videos published. This means that now, we need to take both the month and day of the publish date into account when finding our answer (Hint: `cut` can select multiple fields!). - -Building on your pipeline from the previous question, figure out what day of the year had the most videos published, and how many videos were published on that day. Put your final answers in a markdown cell. - -[NOTE] -==== -https://stackoverflow.com/questions/13690461/using-cut-command-to-remove-multiple-columns[This stackoverflow post] may help you understand how `cut` can be used to remove multiple fields from a string at the same time. As you can see, `cut` is an extremely powerful tool! -==== - -.Deliverables -==== -- The day and number of videos published on the day of the year with the most videos published in our data. -==== - -== Submitting your Work - -Before you submit this project, take a look back through your work. You started out this week's work with a simple `cut`, and by the end had built up a pipeline that created a frequency table using nothing but Bash commands that is able to almost instantly run through a file that is almost 50 thousand lines long! That is impressive, and it clearly demonstrates one of the main strengths of pipelines: they allow you to systematically break down a problem and solve it in steps, as opposed to approaching it all at once. - -In next week's project, we'll begin talking about one of the most powerful multi-tools in bash: `awk`. We'll recreate similar pipelines as in this project, this time using a single tool. Further, we will show how even with a powerful tool like `awk`, piping is still useful and can further improve our power of analysis. - -As always, your helpful team of TAs will be here to assist you through difficulties you may have. We hope you have a great rest of your week. - -.Items to submit -==== -- firstname-lastname-project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5.adoc deleted file mode 100644 index cbf194402..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project5.adoc +++ /dev/null @@ -1,160 +0,0 @@ -= TDM 20100: Project 5 -- More practice with pipelines - -**Motivation:** We continue to practice how to use pipelines. - -**Context:** Pipelines enable us to work on data across many files. - -**Scope:** Pipelines and pattern matching in Bash - -.Learning Objectives: -**** -- Learn about using pattern matching in bash pipelines. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) -- `/anvil/projects/tdm/data/election` (election data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) -- `/anvil/projects/tdm/data/beer/breweries.csv` (breweries data) -- `/anvil/projects/tdm/data/icecream/combined/reviews.csv` (ice cream review data) - -== Questions - -=== Question 1 (2 pts) - -For this question, use the flight files found here: - -`/anvil/projects/tdm/data/flights/subset/[12]*.csv` - -a. Make a list of all of the values that appear in column 9, namely, the UniqueCarrier values. (These are abbreviations for the airlines.) For each such airline, list the number of flights on each airline. The list is short enough that you can display the full list of airlines and the number of flights on each airline. - -b. Make a list of all of the TailNum values that appear in column 11 BUT do not print this list! Just print the number of (distinct) TailNum values. - -[TIP] -==== -You can just use `uniq` instead of `uniq -c` for part b, since you do not need to keep track of the counts. -==== - -[WARNING] -==== -There are more than 10,000 TailNum values, so please *do not* print the full list of TailNum values. You only need to print the *total number of (unique) TailNum values* that occur. -==== - -[NOTE] -==== -Note: If you ever looked at an airplane (or a picture of an airplane), you might have noticed that there is one TailNum painted onto the tail of each airplane, which uniquely identifies that airplane. So, in part b, you are printing the number of airplanes that have flown in the United States! -==== - - -.Deliverables -==== -- a. Print a list of all of all UniqueCarrier values and how many times that each UniqueCarrier appears. -- b. Print the number of (distinct) TailNum values that occur (do not print the list itself; there are more than 10,000 values!) -==== - -=== Question 2 (2 pts) - -In the grocery store data: - -`/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -how many (distinct) PRODUCT_NUM values appear? - -[WARNING] -==== -You only need to print the number of distinct PRODUCT_NUM values. Please do NOT print the entire list. There are more than 100,000 such values! -==== - -.Deliverables -==== -- Print the number of (distinct) PRODUCT_NUM values that appear. (Do not print the list itself!) -==== - -=== Question 3 (2 pts) - -Consider the election files: - -`/anvil/projects/tdm/data/election/itcont*.txt` - -There are more than 200 million donations altogether (i.e., 200 million lines of data; one donation per line). But the number of committee IDs (which are in column 1) is much smaller. - -a. How many (distinct) committee IDs appear altogether? - -b. Which committee ID received the largest number of donations? How many donations did this committee ID receive? (Do not worry about the monetary amounts. Just keep track of the number of donations. There is one donation per line.) - - -.Deliverables -==== -- a. The number of (distinct) committee IDs that appear. -- b. The committee ID that received the largest number of donations, and how many donations that this committee ID received. -==== - -=== Question 4 (2 pts) - -In the breweries data in this file: - -`/anvil/projects/tdm/data/beer/breweries.csv` - -use the `grep` command to print all 22 lines of data (everything on each line) corresponding to Lafayette or West Lafayette (Indiana). - -.Deliverables -==== -- The full contents of the lines of data corresponding to Lafayette or West Lafayette (Indiana). (Please print the whole line of data each time, so you are printing 22 lines of data!) -==== - -=== Question 5 (2 pts) - -In this ice cream reviews file: - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -change each space into a newline, for instance, using any of the methods here: - -`https://askubuntu.com/questions/461144/how-to-replace-spaces-with-newlines-enter-in-a-text-file` - -It might be easiest, for instance, to embed: - -[source, bash] ----- -tr ' ' '\n' ----- - -into your pipeline, which turns each space into a newline. - -Then sort the resulting lines, which will now have only 1 word per line, and find the 25 words that occur most often in the file, along with the number of occurrences of each such word. Hint: The 5 most popular words are: - -[source, bash] ----- - 18688 ice - 19480 a - 26848 and - 29090 I - 34937 the ----- - - -.Deliverables -==== -- The 25 words that occur most often in the file, along with the number of occurrences of each such word. -==== - -== Submitting your Work - -You are now very familiar with bash pipelines! BUT you can still ask us questions anytime, if you need advice or help! - -.Items to submit -==== -- firstname-lastname-project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6-teachinglearning-backup.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6-teachinglearning-backup.adoc deleted file mode 100644 index 5798c10a9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6-teachinglearning-backup.adoc +++ /dev/null @@ -1,139 +0,0 @@ -= TDM 20100: Project 6 -- Awk Everything! 1 - -**Motivation:** As you've seen so far, `bash` has a wide variety of commands that enable us to do different things, and we can use pipes to connect those commands and perform whole loads of data processing in one big step. However, conciseness _is_ a virtue. In this project we'll start learning about `awk`, the bash multi-tool capable of performing the work of tons of commands all by itself. By the end of the next few weeks, you'll be able to do entire pipelines worth of work in just one `awk`! - -**Context:** This project will relate `awk` concepts back to previously learned commands, and at the very least a basic knowledge of filesystem navigation and regex will be needed. - -**Scope:** `awk`, data processing, Bash, GNU, CLI - -.Learning Objectives: -**** -- Learn the general structure of a call to `awk` -- Construct your first basic `awk` -- Use `awk` to print common file information -- Use `awk` to print specific parts of files and accomplish multiple commands worth of work at once -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 (2 pts) - -To begin learning about `awk`, its important to conceptualize the two fundamental units that `awk` operates on: _records_ and _fields_. - -In the context of 101-102 concepts, you can think of a _record_ as one row of data in a file that is "tidy". Alternatively, a record can be thought of as one instance of the things we are tracking in our data. For example, in the YouTube data we've previously used a record corresponds to a single video. - -A _field_ can be thought of as a singular aspect or detail corresponding to a single record. For example, in the YouTube data there are fields like `comment_count`, `publish_time`, `title`, and `channel_name`. - -Use `head` to print the first 3 lines of the `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` file. What are the names of the different _fields_ for this data? - -Using `wc`, figure out how many records exist in this data. You can consider that data 'tidy' (meaning each line is a complete record). Remember that the first line is column headers and thus doesn't count as its own record. - -.Deliverables -==== -- The names of the fields for the Iowa liquor sales data -- The number of records in the Iowa liquor sales data -==== - -=== Question 2 (2 pts) - -As a multi-purpose tool for pattern-matching and processing, `awk` needs to have the ability to structure complex processing and pattern-matching logic in a readable, concise way. - -The general structure of a call to `awk` is as follows: - -[source, bash] ----- -# general structure of a call to awk -awk [FLAGS] '(pattern) {action}' INPUTFILE ----- - -We'll continue to elaborate on this throughout the rest of the project, but for now you can think of `[FLAGS]` as the typical command line flags you've been using with different functions already so far. `(pattern)` is the pattern that must be matched in order to perform `{action}`. Multiple patterns and actions can be included in each call to `awk`, as long as they reside between the two single-quotes in the call. Finally, a path to the desired input file must be provided at the end of the statement. - -In addition to basic structure, `awk` has a lot of special syntax to make things easier/shorter to write. https://man7.org/linux/man-pages/man1/awk.1p.html[The manpages] are always a good place to start, but we'll also walk you through the basics in the following few questions. - -First, try running the below code and observing its output: - -[source, bash] ----- -cd /anvil/projects/tdm/data/iowa_liquor_sales -awk -F ';' '{print $1" | "$5" | "$6} NR==5{exit}' iowa_liquor_sales_cleaner.txt ----- - -There's a lot here, so let's break it down more simply (Remember: https://explainshell.com/[explainshell.com] is a great resource for understanding terminal commands!). - -First, we call `awk` and then use `-F ";"` to specify that the separator for different fields is a semicolon and not whitespace (the default field delimiter for `awk`). - -For our first _action_ block, we don't provide any condition, meaning that `awk` will perform the stuff in the first braces on every record/line of our file. In this case, we tell it `'{print $1" | "$5" | "$6}'`, which as you likely could tell prints the first, fifth, and sixth field for each record. It also adds " | " between each record for readability. - -We then add another condition-command pair, this time with the condition `NR==5` where NR stand for _number of records_. When the number of records we have processed reaches 5 (the fifth line of the file in this case), we perform the action `{exit}`, which ends our `awk` program. - -Using the above code as a guide, write a similar `awk` statement that prints the 2nd, 4th, and 21st fields for the first 10 records of our file. If done correctly, your output should resemble the below text. - -[source, txt] ----- -Date, Store Name, Bottles Sold -08/16/2012, CVS PHARMACY #8443 / CEDAR RAPIDS, 3 -09/10/2014, SMOKIN' JOE'S #6 TOBACCO AND LIQUOR, 12 -04/10/2013, HY-VEE FOOD STORE / MOUNT PLEASANT, 2 -08/30/2012, AFAL FOOD & LIQUOR / DES MOINES, 3 -01/26/2015, HY-VEE FOOD STORE #5 / CEDAR RAPIDS, 4 -07/19/2012, SAM'S MAINSTREET MARKET / SOLON, 12 -11/20/2013, BIG G FOOD STORE, 6 -10/23/2013, DECORAH MART, 12 -05/02/2012, ECON-O-MART / COLUMBUS JUNCTION, 3 ----- - -.Deliverables -==== -- The first, second, and twenty-first fields for the first 10 records in our data -==== - -=== Question 3 (2 pts) - -Show BEGIN and END usage, have them do an example of each - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 4 (2 pts) - -Show compound usage of BEGIN, END, and potentially also during to put together a full command, then have students slightly modify - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 5 (2 pts) - -Demonstrate basic conditionals in awk, then apply - -.Deliverables -==== -- Ipsum lorem -==== - -== Submitting your Work - -This is where we're going to say how to submit your work. Probably a bit of copypasta. - -.Items to submit -==== -- firstname-lastname-project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6.adoc deleted file mode 100644 index 5c4f8c834..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project6.adoc +++ /dev/null @@ -1,212 +0,0 @@ -= TDM 20100: Project 6 -- Awk - -**Motivation:** As you've seen so far, `bash` has a wide variety of commands that enable us to do different things, and we can use pipes to connect those commands and perform whole loads of data processing in one big step. However, conciseness _is_ a virtue. In this project we'll start learning about `awk`. By the end of the next few weeks, you will be able to do entire pipelines worth of work with just one `awk`! - -**Context:** This project will relate `awk` concepts back to previously learned commands, and at the very least a basic knowledge of filesystem navigation and regex will be needed. - -**Scope:** `awk`, data processing, Bash, GNU, CLI - -.Learning Objectives: -**** -- Learn the general structure of a call to `awk` -- Construct your first basic `awk` -- Use `awk` to print common file information -- Use `awk` to print specific parts of files and accomplish multiple commands worth of work at once -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` (Iowa liquor sales data) -- `/anvil/projects/tdm/data/election` (election data) -- `/anvil/projects/tdm/data/beer/reviews.csv` (beer reviews data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) - -Example 1: - -++++ - -++++ - -Example 2: - -++++ - -++++ - -Example 3: - -++++ - -++++ - -Example 4: - -++++ - -++++ - -Example 5: - -++++ - -++++ - -Example 6: - -++++ - -++++ - - -== Questions - -=== Question 1 (2 pts) - -To begin learning about `awk`, it is important to conceptualize the two fundamental units that `awk` operates on: _records_ and _fields_. - -You can think of a _record_ as one row of data in a file. - -A _field_ can be thought of as a singular element of data (within a row of data). - -Simple awk files that run on comma-separated data look this this: - -[source, bash] ----- -awk -F, 'BEGIN{action to run before processing the data} {action to perform on each row of data} END{action to run after processing the data}' mydatafile.txt ----- - -but many awk programs do not have the beginning or ending section, i.e., they just look like this: - -[source, bash] ----- -awk -F, '{action to perform on each row of data}' mydatafile.txt ----- - -The action to run before processing the data only runs once, before `awk` ever touches the data. It is helpful, for example, if you want to print a header before running your output. (The BEGIN section is often omitted.) - -The action to run after processing the data will print after the data is all processed. It is helpful if you want to print out some calculations that you ran on the data set. (The END section is often omitted.) - -The main action of an awk program will run on each and every line of the data. - -If the data is not comma separated, but (instead) is tab-separated, then we use `-F\t` instead of `-F,` (as an example). Or if the data has `|` between the pieces of data, then we use `-F'|'` instead. Or if the data has `;` between the pieces of data, then we use `-F';'` instead. - -The Sales values (given in Dollars) are available in field 22 of this data set: - -`head /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f22` - -Use `awk` (without using `cut`!) to add the total Sales values from the entire file. - -[NOTE] -==== -The total Sales values are almost 4 billion dollars. The `e+09` that shows up in the answer means scientific notation with nine zeros, i.e., multiply the answer by 1,000,000,000, i.e., by 1 billion. -==== - -[WARNING] -==== -This question might take 2 or 3 minutes to run. -==== - -.Deliverables -==== -- Print the total Sales values from the entire file. -==== - -=== Question 2 (2 pts) - -2a. Very similarly to question 1, use `awk` (without using `cut`!) to sum the total dollar amounts of the donations (altogether) given in column 15 of this file: `/anvil/projects/tdm/data/election/itcont1980.txt` - -2b. Now sum the total dollar amounts of the donations (altogether) given in column 15 of all of the `itcont*.txt` files (altogether). - -[NOTE] -==== -2a. The total for your answer should be a little more than 200 million dollars. - -2b. The total for your answer should be a little more than 62 billion dollars. -==== - -[WARNING] -==== -Question 2b might take 30 to 60 minutes to run. -==== - -.Deliverables -==== -- a. Print the sum of the total dollar amounts of the donations in the 1980 election data. -- b. Print the sum of the total dollar amounts of the donations in all of the election data files of the form `itcont*.txt`. -==== - -=== Question 3 (2 pts) - -Consider the data in the file `/anvil/projects/tdm/data/beer/reviews.csv` - -Notice that the number of columns on each line varies, because each line has a varied number of commas. Also note that the number of fields on each line is `NF` and therefore the *last* field on each line is `$NF`. Use this information to add all of the values in the `score` column, in a variable called `totalscores`. Also, at the same time, add the number of lines, in a variable called `totallines`. Finally, at the end, print the ratio of `totalscores` and `totallines`, so that we have the overall average score across the entire data set. - - -.Deliverables -==== -- Print the overall average score across the entire data set. -==== - - -=== Question 4 (2 pts) - -Consider the data in the file `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Use `grep SOUTH` and also `awk` (not `cut`) to sum the total amount of values in the `SPEND` column (corresponding to lines with `SOUTH` for the `STORE_R` value). Then do this again (in a separate bash pipeline) for the `EAST` stores, and then do it again (in a third bash pipeline) for the `WEST` stores, and finally in a fourth bash pipeline for the `CENTRAL` stores. - -[NOTE] -==== -In the future, we will learn how to do all of this with one line of `awk` but for now it is OK to do this in four separate bash pipelines. -==== - - -.Deliverables -==== -- Print the sum of the `SPEND` column values corresponding to each of the four store regions. This will take four separate bash pipelines, one Jupyter Lab cell each. -==== - - -=== Question 5 (2 pts) - -Consider the data in the file `/anvil/projects/tdm/data/flights/subset/1990.csv` - -Use `awk` for formatted output, like this: - -`awk -F, '{print "flights from "$17" to "$18;}'` - -incorporated into a pipeline (with `sort | uniq -c | sort -n | tail`) from the previous projects, to find the 10 most popular flight paths in 1990 and the number of flights on those paths. Hint: The top two flight paths should be: - -[source, bash] ----- - 25779 flights from LAX to SFO - 26134 flights from SFO to LAX ----- - - -.Deliverables -==== -- Print the 10 most popular flight paths in 1990 and the number of flights on those paths, with the nice formatting described above. -==== - - -== Submitting your Work - -We are just starting to get familiar with `awk` so please feel welcome to ask for clarifications and help anytime. This is a powerful tool that will enable you to (pre-)process data and to analyze data very, very quickly. It is also a wonderful tool to incorporate in `bash` pipelines. - - -.Items to submit -==== -- firstname-lastname-project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project7.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project7.adoc deleted file mode 100644 index 53d73b175..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project7.adoc +++ /dev/null @@ -1,190 +0,0 @@ -= TDM 20100: Project 7 -- Awk - -**Motivation:** Now we will learn about associative arrays in awk, which allow you to (for instance) add entries in one column, grouped according to the entries in another column. - -**Context:** We learn how to use associative arrays in awk. - -**Scope:** associative arrays in `awk` - -.Learning Objectives: -**** -- Use associative arrays in `awk` -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` (Iowa liquor sales data) -- `/anvil/projects/tdm/data/election/itcont1980.txt` (election data) -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` (beer reviews data) -- `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` (grocery store data) -- `/anvil/projects/tdm/data/flights/subset/` (airplane data) - -Example 1: - -++++ - -++++ - -Example 2: - -++++ - -++++ - -Example 3: - -++++ - -++++ - - - -== Questions - -=== Question 1 (2 pts) - -In Project 6, Question 1, we added all of the sales amounts in dollars from column 22 of this file of liquor sales in Iowa: - -`/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -Now, we can (instead) use associative arrays in awk to add the sales amounts in dollars for each store (from column 4): Grouping the sales amounts according to the store, add the sale amounts for each store. For your output, on each line, print the dollar amount for the sales of each store, and the store too. Sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output. - -[NOTE] -==== -The bottom 4 lines are given below. -==== - -[NOTE] -==== -Because these numbers are really large, instead of using `sort -n` you need to use `sort -g` which can handle numbers in scientific notation. These numbers are tens and hundreds of millions. -==== - -[source,text] ----- -4.13889e+07 at SAM'S CLUB 8162 / CEDAR RAPIDS -4.87686e+07 at HY-VEE WINE AND SPIRITS / IOWA CITY -1.08328e+08 at CENTRAL CITY 2 -1.23313e+08 at HY-VEE #3 / BDI / DES MOINES ----- - -.Deliverables -==== -- Print the 10 largest sales amounts according to the stores. (The 4 largest are shown above.) -==== - - -=== Question 2 (2 pts) - -Back in Project 6, Question 2, we added all of the donation dollar amounts from the 1980 election data: - -`/anvil/projects/tdm/data/election/itcont1980.txt` - -Now, instead, let's add the donation dollar amounts in each state: Grouping the transaction amounts according to the state, add the donation dollar amounts for each state. For your output, on each line, print the dollar amount and the state, and sort the output in numerical order. It is OK to just print the tail of the result, so that you will print 10 lines of output. - -[NOTE] -==== -The bottom 4 lines are given below. -==== - -[source, bash] ----- -17916669 -18673301 NY -24085171 CA -24472610 TX ----- - -[NOTE] -==== -The line 17916669 without a state corresponds to the sum of the donation amounts where the state was blank! -==== - -.Deliverables -==== -- Print the 10 largest total dollar amounts of donations, and the analogous states where those donations were made. (The 4th largest one is a blank state and that is OK.) -==== - - - -=== Question 3 (2 pts) - -In this *sample* file of beer reviews `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -Consider the mean beer scores on each date. - -Find the three dates on which the mean score is a 5. - -[NOTE] -==== -This week, we are using only the sample file: `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -You *DO NOT* need to use the (much larger) file from last week: `/anvil/projects/tdm/data/beer/reviews.csv` -==== - -[NOTE] -==== -A mean `score` of "5" is a perfect score, so you can add up the scores on each date, and the number of reviews on each date, and print the ratio of the sum of scores and the number of reviews, and then sort the results numerically, and print only the tail. -==== - -[NOTE] -==== -The date is in the 3rd field and the score is the last field on each line, i.e., the score is stored in `$NF`. -==== - -.Deliverables -==== -- In the reviews sample file, show the three dates on which the mean `score` is a 5. -==== - - -=== Question 4 (2 pts) - -Consider the data in the file `/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv` - -Solve the same question from Project 6, Question 4, again, but this time use associative arrays. By using associative arrays, you can solve this question with just 1 line of awk. You should just use awk one time (not 4 times). By using associative arrays, you can add the total amounts of the values in the `SPEND` column, grouping the values according to the `STORE_R` column, and print the results for all 4 regions using awk just one time. - - -.Deliverables -==== -- Print the sum of the `SPEND` column values corresponding to each of the four store regions. Use `awk` only one time (by using associative arrays). -==== - - -=== Question 5 (2 pts) - -Find the average `DepDelay` from each `Origin` airport in 1990, i.e., using the data in the file `/anvil/projects/tdm/data/flights/subset/1990.csv` - -You *do not need* to print the output for all of the `Origin` airports. Instead, it is OK to include: - -`grep 'EWR\|JFK\|LGA'` - -at the end of your pipeline, so that you are only displaying the average departure delays for the three huge `Origin` airports in New York City. - -.Deliverables -==== -- Print the average departure delays for the three biggest airports located in New York City, namely, the average departure delay from EWR, the average departure delay from JFK, and the average departure delay from LGA. -==== - - -== Submitting your Work - -Please let us know (anytime!) if you need help as you are learning about associative arrays in awk. - - - -.Items to submit -==== -- firstname-lastname-project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project8.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project8.adoc deleted file mode 100644 index eebf05c84..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project8.adoc +++ /dev/null @@ -1,191 +0,0 @@ -= TDM 20100: Project 8 -- SQL - -**Motivation:** Starting with this project, we will learn about databases, which allow you to write queries about data. - -**Context:** We will focus on SQLite. Once you understand how to make queries in SQLite, you will have a strong foundation that can be used to learn other database resources, such as MySQL. - -**Scope:** SQLite queries that do not need `JOIN` - -.Learning Objectives: -**** -- We will learn how to make SQLite queries on one table at a time (without using the `JOIN`) -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset: - -- `/anvil/projects/tdm/data/lahman/lahman.db` (Lahman baseball database) - -Our page in The Examples Book about SQL (in general) is given here: https://the-examples-book.com/tools/SQL/ - -[IMPORTANT] -==== -Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: https://the-examples-book.com/tools/SQL/lahman-examples-no-joins All of these examples are (relatively) simple and do not need to `JOIN` two tables. They just rely on one table of the database at a time. -==== - -== Questions - -Using the `seminar` kernel, if you run this line in a cell by itself: - -`%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db` - -[TIP] -==== -If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session. -==== - - -After running the line above (in your session, just once), then you can make SQL queries in subsequent cells in Jupyter Lab, like this, for example, which shows all of the information on 5 lines of the `Pitching` table: - -[source,bash] ----- -%%sql -SELECT * FROM Pitching LIMIT 5; ----- - -or like this, which shows 5 lines corresponding to the `Teams` table for which the number of wins is 110 or larger. - -[source,bash] ----- -%%sql -SELECT * FROM Teams WHERE W >= 110 LIMIT 5; ----- - -[WARNING] -==== -It is really important to include `LIMIT 5` or something similar, for instance, `LIMIT 20`, so that you do not try to print all of the results from a SQL table in your Jupyter Lab notebook. -==== - -The list of all of the tables in this database are: - -[source,bash] ----- -AllstarFull -Appearances -AwardsManagers -AwardsPlayers -AwardsShareManagers -AwardsSharePlayers -Batting -BattingPost -CollegePlaying -Fielding -FieldingOF -FieldingOFsplit -FieldingPost -HallOfFame -HomeGames -Managers -ManagersHalf -Parks -People -Pitching -PitchingPost -Salaries -Schools -SeriesPost -Teams -TeamsFranchises -TeamsHalf ----- - -Please read the examples given here: https://the-examples-book.com/tools/SQL/lahman-examples-no-joins and then you are ready to start the questions for this project! - -=== Question 1 (2 pts) - -a. From the `Teams` table, print the row corresponding to the 2023 data for the team with `name = 'Chicago Cubs'`. Your output will be just one row, showing the Cubs overall information for 2023. - -b. From the `Batting` table, print the 48 rows corresponding to the 2023 data for the players from the team with `teamID = 'CHN'` (this is the Chicago Cubs `teamID`). - -[TIP] -==== -For both 1a and 1b, since we only want to see the 2023 results, you need to use `yearID = 2023` as a condition in your query. -==== - - -.Deliverables -==== -- Print the 2023 summary data from the `Teams` table for the team with `name = 'Chicago Cubs'` (just one row of output). - -- Print the 48 rows of table from the `Batting` table for the 2023 Chicago Cubs players. -==== - - -=== Question 2 (2 pts) - -Print the rows of the `Teams` table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams. - -.Deliverables -==== -- Print the rows of the `Teams` table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams. -==== - - - -=== Question 3 (2 pts) - -a. Considering the `People` table, find the `playerID` for Rickey Henderson. - -b. Using the `playerID` that you found in question 3a, now use the `Batting` table to print all of the rows corresponding to Rickey Henderson's `playerID`. - -c. Finally, again using the `Batting` table, print *only* the `SUM\(R)` and `SUM(SB)` for Rickey Henderson, which are his total number of runs in his career and his total number of stolen bases in his career. - -[TIP] -==== -He had 2295 runs scored altogether and 1406 stolen bases. -==== - - -.Deliverables -==== -- Use the `People` table to find Rickey Henderson's `playerID` -- Print all of the rows of the `Batting` table corresponding to Rickey Henderson. -- Print only the sum of his number of runs in his career and also the sum of his number of stolen bases in his career. -==== - - -=== Question 4 (2 pts) - -a. Use the `Batting` table to find the top 5 players of all time, in terms of their total number of hits, in other words, according to `SUM(H)`. Please print only the top 5 players (their `playerID`) and the number of hits in each of their careers. - -b. Same question as 4a, but this time use home runs (according to `SUM(HR)`) instead of hits. - -.Deliverables -==== -- Print *only* the top 5 players' IDs and the number of hits in each of their careers. -- Print *only* the top 5 players' IDs and the number of home runs in each of their careers. -==== - - -=== Question 5 (2 pts) - -Consider the `Schools` table, group together the schools in each state. Print the number of schools in each group, using `COUNT(*) as mycounts, state` so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of `mycounts` in descending order (which is denoted by `DESC`), in other words, the states with the most schools are printed first in your list. In this way, by using `LIMIT 5`, you will display the states with the most schools. - - -.Deliverables -==== -- Print a list of the top 5 states according to how many schools are located there, and the number of schools in each of those top 5 states. -==== - - -== Submitting your Work - -We hope that you enjoyed learning about databases this week! Please let us know if we can assist, as you are learning these new ideas! - - - -.Items to submit -==== -- firstname-lastname-project8.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project9.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project9.adoc deleted file mode 100644 index 8777f8d75..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-project9.adoc +++ /dev/null @@ -1,162 +0,0 @@ -= TDM 20100: Project 9 -- SQL - -**Motivation:** Now we learn how to write SQL queries that rely on more than one table. - -**Context:** The `JOIN` in SQL enables us to make queries that rely on information from multiple SQL tables. It is absolutely important to tell SQL which rows need to agree, by including the `ON` portion of the `JOIN` statement. - -**Scope:** SQLite queries use a `JOIN` to gather information from more than one table. - -.Learning Objectives: -**** -- We will learn how to make SQLite queries on multiple tables at a time (using the `JOIN`) -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset: - -- `/anvil/projects/tdm/data/lahman/lahman.db` (Lahman baseball database) - -Our page in The Examples Book about SQL (in general) is given here: https://the-examples-book.com/tools/SQL/ - -[IMPORTANT] -==== -Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: https://the-examples-book.com/tools/SQL/lahman-examples-one-join All of these examples rely on one `JOIN` statement, to extract information from two tables. -==== - -== Questions - -Using the `seminar` kernel, if you run this line in a cell by itself: - -`%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db` - -[TIP] -==== -If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session. -==== - - -Again, we remind students that the list of all of the tables in this database are: - -[source,bash] ----- -AllstarFull -Appearances -AwardsManagers -AwardsPlayers -AwardsShareManagers -AwardsSharePlayers -Batting -BattingPost -CollegePlaying -Fielding -FieldingOF -FieldingOFsplit -FieldingPost -HallOfFame -HomeGames -Managers -ManagersHalf -Parks -People -Pitching -PitchingPost -Salaries -Schools -SeriesPost -Teams -TeamsFranchises -TeamsHalf ----- - -Please read the examples given here: https://the-examples-book.com/tools/SQL/lahman-examples-one-join and then you are ready to start the questions for this project! - -[IMPORTANT] -==== -In the page of examples, sometimes we write `JOIN` and sometimes we write `INNER JOIN`. These are interchangeable; in other words, `JOIN` and `INNER JOIN` mean the same thing. (There are other types of statements such as `LEFT JOIN` and `RIGHT JOIN` but we will not use either of these, in this project.) -==== - -=== Question 1 (2 pts) - -Join the `Batting` table to the `People` by matching the `playerID` values in these two tables. For all 48 players on the 2023 Chicago Cubs team, print their `PlayerID` (from either table), as well as their hits (`H`) and home runs (`HR`) from the `Batting` table, and also their `nameFirst` and `nameLast` from the `People` table. - -.Deliverables -==== -- Print the `playerID`, `H`, `HR`, `nameFirst`, and `nameLast` values for all 48 of the players on the 2023 Chicago Cubs team. -==== - - -=== Question 2 (2 pts) - -Join the `Batting` table to the `Pitching` table by matching the `playerID`, `yearID`, and `stint` columns. There is only one person from 2023 appearing in both of these tables that hit more than 30 home runs. Print this person's `playerID` and the number of home runs (`HR`) that they attained (from the `Batting` table). - - -.Deliverables -==== -- Print the `PlayerID` and the number of home runs (`HR`) from the `Batting` table for the only person who is in both the `Batting` and `Pitching` table in 2023 who had more than 30 home runs (`HR`) in the `Batting` table. -==== - - - -=== Question 3 (2 pts) - -In this question, we will accomplish everything from Project 8, Question 3abc in just one query. - -Join the `People` and `Batting` table by matching the `playerID` values in these two tables. Print only 1 row, corresponding to Rickey Henderson, displaying his `playerID`, `nameFirst`, `nameLast`, `SUM\(R)`, and `SUM(SB)` values. - -[TIP] -==== -He had 2295 runs scored altogether and 1406 stolen bases. -==== - - -.Deliverables -==== -- Print only 1 row, corresponding to Rickey Henderson, displaying his `playerID`, `nameFirst`, `nameLast`, `SUM\(R)`, and `SUM(SB)` values. -==== - - -=== Question 4 (2 pts) - -a. As in Project 8, Question 4a, use the `Batting` table but now also `JOIN` the `People` table (by matching the `playerID` values), to find the top 5 players of all time, in terms of their total number of hits, in other words, according to `SUM(H)`. For the top 5 players (in terms of the total number of hits), print their `playerID`, the `SUM(H)` (in other words, their total number of hits in their careers), and their `nameFirst` and `nameLast` values. - -b. Same question as 4b, but this time use home runs (according to `SUM(HR)`) instead of hits. - - -.Deliverables -==== -- For the top 5 players (in terms of the total number of hits), print their `playerID`, the `SUM(H)` (in other words, their total number of hits in their careers), and their `nameFirst` and `nameLast` values. -- For the top 5 players (in terms of the total number of home runs), print their `playerID`, the `SUM(HR)` (in other words, their total number of home runs in their careers), and their `nameFirst` and `nameLast` values. -==== - - -=== Question 5 (2 pts) - -Join the `CollegePlaying` and `People` tables on the `playerID` values. Print the `DISTINCT(playerID)` and `nameFirst` and `nameLast` values from the `People` table for each of the 15 distinct players that have `schoolID = 'purdue'` in the `CollegePlaying` table. - -.Deliverables -==== -- Print the `DISTINCT(playerID)` and `nameFirst` and `nameLast` values from the `People` table for each of the 15 distinct players that have `schoolID = 'purdue'` in the `CollegePlaying` table. -==== - - -== Submitting your Work - -We hope that you enjoyed learning about databases this week! Please let us know if we can assist, as you are learning these new ideas! - - - -.Items to submit -==== -- firstname-lastname-project9.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== - diff --git a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-projects.adoc deleted file mode 100644 index 3a06ee813..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/20100/20100-2024-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 20100 - -== Important Links - -xref:fall2024/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:fall2024/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm1010010200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Aug 19, 2024 | Aug 30, 2024 -| Academic Integrity Quiz | Aug 19, 2024 | Aug 30, 2024 -| Project 1 - Welcome to the CLI! | Aug 19, 2024 | Aug 30, 2024 -| Project 2 - Working with the bash shell | Aug 22, 2024 | Aug 30, 2024 -| Project 3 - Pipelines | Aug 29, 2024 | Sep 06, 2024 -| Project 4 - Pattern matching in pipelines | Sep 05, 2024 | Sep 13, 2024 -| Outside Event 1 | Aug 19, 2024 | Sep 13, 2024 -| Project 5 - More practice with pipelines | Sep 12, 2024 | Sep 20, 2024 -| Project 6 - Awk | Sep 19, 2024 | Sep 27, 2024 -| Project 7 - Awk | Sep 26, 2024 | Oct 04, 2024 -| Outside Event 2 | Aug 19, 2024 | Oct 04, 2024 -| Project 8 - SQL | Oct 03, 2024 | Oct 18, 2024 -| Project 9 - SQL | Oct 17, 2024 | Oct 25, 2024 -| Project 10 - SQL | Oct 24, 2024 | Nov 01, 2024 -| Project 11 - SQL | Oct 31, 2024 | Nov 08, 2024 -| Outside Event 3 | Aug 19, 2024 | Nov 08, 2024 -| Project 12 - SQL | Nov 7, 2024 | Nov 15, 2024 -| Project 13 - SQL | Nov 14, 2024 | Nov 29, 2024 -| Project 14 - Class Survey | Nov 21, 2024 | Dec 06, 2024 -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -// **Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project1.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project1.adoc deleted file mode 100644 index 58fe4f527..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project1.adoc +++ /dev/null @@ -1,183 +0,0 @@ -= TDM 30100: Project 01 - Intro to ML - Using Anvil - -== Project Objectives - -We remind ourselves how to use the Anvil platform and how to run Python code in Jupyter Lab. We also remind ourselves about using the Pandas library. This project is intended to be a light start to the fall semester. - -.Learning Objectives -**** -- Create and use Anvil sessions -- Create Jupyter notebooks -- Load dataset with pandas -- Basic data manipulation with pandas -**** - -== Dataset - -This project will use the following dataset: -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -Let's start out by starting a new Anvil session. If you do not remember how to do this, please read through https://the-examples-book.com/projects/fall2024/10100/10100-2024-project1[Project 1 at the introduction TDM 10100 level]. - -Once you have started a new Anvil session, download https://the-examples-book.com/projects/_attachments/project_template.ipynb[the project template] and upload it. Then, open this template in Jupyter notebook. Save it as a new file with the following naming convention: `lastname_firstname_project#.ipynb`. For example, `doe_jane_project1.ipynb`. - -[NOTE] -==== -You may be prompted to select a kernel when opening the notebook. We will use the `seminar` kernel (not the `seminar-r` kernel) for TDM 30100 projects. You are able to change the kernel by clicking on the kernel dropdown menu and selecting the appropriate kernel if needed. -==== - -To make sure everything is working, run the following code cell: -[source,python] ----- -print("Hello, world!") ----- - -Your output should be `Hello, world!`. If you see this, you are ready to move on to the next question. - -Although question 1 is trivially easy, we still want you to (please) get into the habit of commenting on the work in each question. So (please) it would be helpful to write (in a separate cell) something like, "We are reminding ourselves how to use Anvil and how to print a line of output." - -.Deliverables -==== -- Output of running the code cell -- Be sure to document your work from Question 1, using some comments and insights about your work. -==== - -=== Question 2 (2 points) - -Now that we have our Jupyter Lab notebook set up, let's begin working with the pandas library. - -Pandas is a Python library that allows us to work with datasets in tabular form. There are functions for loading datasets, manipulating data, etc. - -To start out with, let's load the Iris dataset that is located at `/anvil/projects/tdm/data/iris/Iris.csv`. - -To do this, you will need to import the pandas library and use the `read_csv` function to load the dataset. - -Run the following code cell to load the dataset: -[source,python] ----- -import pandas as pd - -myDF = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') ----- - -[NOTE] -==== -In the provided code, pandas is imported as `pd` for brevity. This is a common convention in the Python community. Similarly, `myDF` (short for "my dataframe") is often used as a variable for pandas dataframes. It is not required for you to follow either of these conventions, but it is good practice to do so. -==== - -Now that our dataset is loaded, let's take a look at the first 5 rows of the dataset. To do this, run the following code cell: -[source,python] ----- -myDF.head() ----- - -[NOTE] -==== -The head function is used to display the first n rows of the dataset. By default, n is set to 5. You can change this by passing an integer to the function. For example, `myDF.head(10)` will display the first 10 rows of the dataset. This function is useful for quickly inspecting the dataframe to see what the data looks like. -==== - -.Deliverables -==== -- Output of running the code cell -- Be sure to document your work from Question 2, using some comments and insights about your work. -==== - -=== Question 3 (2 points) - -An important aspect of our dataframe for machine learning is the shape (rows, columns). As you will learn later, the shape will help us determine what kind of machine learning model will be the best fit, as well as how complex it may be. - -To get the shape of the dataframe, run the following code cell: -[source,python] ----- -myDF.shape ----- - -[NOTE] -==== -There are multiple ways to get the number of rows and columns in a DataFrame. `len(myDF.index)` gives the number of rows, and `len(myDF.columns)` gives the number of columns in a DataFrame. The `shape` attribute is commonly preferred because it’s more concise and returns both the number of rows and columns in a single call. -==== - -This returns a tuple in the form (rows, columns). - -.Deliverables -==== -- How many rows are in the dataframe? -- How many columns are in the dataframe? -- Be sure to document your work from Question 3, using some comments and insights about your work. -==== - -=== Question 4 (2 points) - -Now that we have loaded the dataset, let's investigate how we can manipulate the data. - -One common operation is to select a subset of the data. This is done using the `iloc` function, which allows us to index the dataframe by row and column numbers. -[NOTE] -==== -The `iloc` function is extremely powerful. It can be used in way too many ways to list here. For a more comprehensive list of how to use `iloc`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html[the official pandas iloc documentation]. -==== - -To select the first n rows of the dataframe, we can use the `iloc` function with a slice: `myDF.iloc[:n]`. - -Write code to select the first 10 rows of the dataframe from Question 3 into a new dataframe called `myDF_subset`. Print the shape of `myDF_subset` to verify that you have selected the correct number of rows. - -We can also use the `iloc` function to select specific columns. To select specific columns, we can also use a slice, however we must specify the rows we want first. To select all rows, we simply pass a colon `:`. For example, to select the first 10 rows and the first 3 columns, we could use the following code: `myDF.iloc[:10, :3]`. - -Write code to select the 40th through 50th rows (inclusive) and the 2nd and 4th columns of the dataframe from Question 3 into a new dataframe called `myDF_subset2`. Print the shape of `myDF_subset2` to verify that you have selected the correct number of rows and columns. - -The iloc function can also be used to filter rows based on a condition. For example, if we wanted all rows where the PetalWidthCm is greater than 1.5, we could use the following code: `myDF.loc[myDF['PetalWidthCm'] > 1.5, :]`. - -Write code to select all rows where SepalLengthCm is less than 5.0 into a new dataframe called `myDF_subset3`. How many rows are in this dataframe? - -.Deliverables -==== -- Output of printing the shape of `myDF_subset` -- Output of printing the shape of `myDF_subset2` -- How many rows are in the `myDF_subset3` dataframe? -- Be sure to document your work from Question 4, using some comments and insights about your work. -==== - -=== Question 5 (2 points) - -Another common operation is to remove column(s) from the dataframe. This is done using the `drop` function. - -[NOTE] -==== -Similarly to the `iloc` function, the `drop` function is extremely powerful. For a more comprehensive list of how to use `drop`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html[the official pandas drop documentation]. -==== - -The most readable way to drop a column is by dropping it by name. To drop column(s) by name, you can use the following syntax: `myDF.drop(['column1_name', 'column2_name', ...], axis=1)`. The `axis=1` argument tells pandas to drop columns, not rows. - -Write code to drop the `Id` column from the myDF_subset into a new dataframe called `myDF_without_id`. Print the shape of the dataframe to verify that the column has been removed. - -Additionally, we can extract columns from a dataframe into a new dataframe. Extracting a column is very simple: `myDF['column_name']` will return a pandas series containing the values of the column. To extract multiple columns, you can pass a list of column names: `myDF[['column1_name', 'column2_name', ...]]`. -To then store these series into a new dataframe, we can simply cast the series into a dataframe: `pd.DataFrame(myDF['column_name'])`. - -Write code to extract the `Species` and `SepalWidthCm` columns from the `myDF_without_id` dataframe into a new dataframe called `myDF_species`. Print the shape of the dataframe to verify that the column has been extracted. Print the first 5 rows of the dataframe to verify that the columns have been extracted correctly. - -.Deliverables -==== -- Output of printing the shape of the dataframe after dropping the `Id` column -- Output of printing the first 5 rows of the dataframe after extracting the `Species` and `SepalWidthCm` columns -- Be sure to document your work from Question 5, using some comments and insights about your work. -==== - - -== Submitting your Work - -Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project10.adoc deleted file mode 100644 index c2fa8286f..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project10.adoc +++ /dev/null @@ -1,305 +0,0 @@ -= 301 Project 10 - Regression: Perceptrons - -== Project Objectives - -In this project, we will be learning about perceptrons and how they can be used for regression. We will be using the Boston Housing dataset as it has many different potential features and target variables. - -.Learning Objectives -**** -- Understand the basic concepts behind a perceptron -- Implement activation functions and their derivatives -- Implement a perceptron class for regression -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -A perceptron is a simple model that can be used for regression. These perceptrons can be combined together to create neural networks. In this project, we will be creating a perceptron from scratch. - -To start, let's load in the Boston Housing dataset with the below code: -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Now, we can begin discussing what a perceptron is. A perceptron is a simple model that takes in a set of inputs and produces an output. The perceptron is defined by a set of weights and a bias term (similar to our linear regression model having coefficients and an y intercept term). The perceptron then takes the dot product of the input features and the weights and adds the bias term. - -Then, the perceptron will apply some activation function before outputting the final value. This activation function is some non-linear function that allows the perceptron to learn complex data, instead of behaving as a linear model. - -There are many different activation functions, some of the most common are listed below: - -[cols="2,2,2,2",options="header"] -|=== -|Activation Function | Formula | Derivative | Usage -|Linear | x | 1 | Final layer of regression to output continuous values -|ReLU | max(0, x) | 1 if x > 0, 0 otherwise | Hidden layers of neural networks -|Sigmoid | 1 / (1 + exp(-x)) | sigmoid(x) * (1 - sigmoid(x)) | Final layer of binary classification, or hidden layers of neural networks -|Tanh | (exp(x) - exp(-x)) / (exp(x) + exp(-x)) | 1 - tanh(x)^2 | Hidden layers of neural networks -|=== - -For this project, we will be creating a perceptron class that can be used for regression. There are many different parameters that can be set when creating a perceptron, such as the learning rate, number of epochs, and activation function. - -For this question, please implement functions for the Linear, ReLU, Sigmoid, and Tanh activation functions. Additionally, implement the derivative of each of these functions. These functions should be able to take in a numpy array and return the transformed array. - -[source,python] ----- -import numpy as np -def linear(x): - pass -def linear_d(x): - pass -def relu(x): - pass -def relu_d(x): - pass -def sigmoid(x): - pass -def sigmoid_d(x): - pass -def tanh(x): - pass -def tanh_d(x): - pass ----- - -To test your functions, you can use the below code: -[source,python] ----- -x = np.array([-1, 0, 1]) -print(linear(x)) # should return [-1, 0, 1] -print(linear_d(x)) # should return [1, 1, 1] -print(relu(x)) # should return [0, 0, 1] -print(relu_d(x)) # should return [0, 0, 1] -print(sigmoid(x)) # should return [0.26894142, 0.5, 0.73105858] -print(sigmoid_d(x)) # should return [0.19661193, 0.25, 0.19661193] -print(tanh(x)) # should return [-0.76159416, 0, 0.76159416] -print(tanh_d(x)) # should return [0.41997434, 1, 0.41997434] ----- - -.Deliverables -==== -- Completed activation and derivative functions -- Test the functions with the provided code -==== - -=== Question 2 (2 points) - -Now that we have our activation functions, let's start working on our Perceptron class. This class will create a perceptron that can be used for regression problems. Below is a skeleton of our Perceptron class: - -[source,python] ----- -class Perceptron: - def __init__(self, learning_rate=0.01, n_epochs=1000, activation='relu'): - # this will initialize the perceptron with the given parameters - pass - - def activate(self, x): - # this will apply the activation function to the input - pass - - def activate_derivative(self, x): - # this will apply the derivative of the activation function to the input - pass - - def compute_linear(self, X): - # this will calculate the linear combination of the input and weights - pass - - def error(self, y_pred, y_true): - # this will calculate the error between the predicted and true values - pass - - def backward_gradient(self, error, linear): - # this will update the weights and bias of the perceptron - pass - - def predict(self, X): - # this will predict the output of the perceptron given the input - pass - - def train(self, X, y, reset_weights = True): - # this will train the perceptron on the given input and target values - pass - - def test(self, X, y): - # this will test the perceptron on the given input and target values - pass ----- - -Now, it may seem daunting to implement all of these functions. However, most of these functions are as simple as one mathematical operation. - -*For this question, please implement the `__init__`, `activate`, and `activate_derivative` functions.* -The `__init__` function should initialize the perceptron with the given parameters, as well as setting weights and bias terms to None. - -The `activate` function should apply the activation function to the input, and the `activate_derivative` function should apply the derivative of the activation function to the input. It is important that these functions use the proper function based on the value of `self.activation`. Additionally, if the activation function is not set to one of the three functions we implemented earlier, the default should be the ReLU function. - -To test your functions, you can use the below code: -[source,python] ----- -test_x = np.array([-2, 0, 1.5]) -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='linear') -print(p.activate(test_x)) # should return [-2, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [1, 1, 1] -p.activation = 'relu' -print(p.activate(test_x)) # should return [0, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [0, 0, 1] -p.activation = 'sigmoid' -print(p.activate(test_x)) # should return [0.11920292, 0.5, 0.81757448] -print(p.activate_derivative(test_x)) # should return [0.10499359, 0.25, 0.14914645] -p.activation = 'tanh' -print(p.activate(test_x)) # should return [-0.96402758, 0, 0.90514825] -print(p.activate_derivative(test_x)) # should return [0.07065082, 1, 0.18070664] -p.activation = 'invalid' -print(p.activate(test_x)) # should return [0, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [0, 0, 1] ----- -.Deliverables -==== -- Implement the `__init__`, `activate`, and `activate_derivative` functions -- Test the functions with the provided code -==== - -=== Question 3 (2 points) - -Now, let's move onto the harder topics. The basic concept behind how this perceptron works is that it will take in an input, calculate the predicted value, find the error between the predicted and true value, and then update the weights and bias based on this error and it's learning rate. This process is then repeated for the set number of epochs. - -In this sense, there are to main portions of the perceptron that need to be implemented: the forward and backward passes. The forward pass is the process of calculating the predicted value, and the backward pass is the process of updating the weights and bais based on the calculated error. - -*For this question, we will implement the `compute_linear`, `predict`, `error`, and `backward_gradient` functions.* - -The `compute_linear` function should calculate the linear combination of the input, weights, and bias, by computing the dot product of the input and weights and adding the bias term. - -The `predict` function should compute the linear combination of the input and then apply the activation function to the result. - -The `error` function should calculate the error between the predicted (y_pred) and true (y_true) values, ie true - predicted. - -The `backward_gradient` should calculate the gradient of the error, which is simply the negative of the error multiplied by the activation derivative of the linear combination. - -To test your functions, you can use the below code: -[source,python] ----- -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='sigmoid') -p.weights = np.array([1, 2, 3]) -p.bias = 4 - -test_X = np.array([1,2,3]) -test_y = np.array([20]) - -l = p.compute_linear(test_X) -print(l) # should return 18 -error = p.error(l, test_y) -print(error) # should return 2 -gradient = p.backward_gradient(error, l) -print(gradient) # should return -3.04599585e-08 -pred = p.predict(test_X) # should return 0.9999999847700205 -print(pred) ----- - -.Deliverables -==== -- Implement the `compute_linear`, `predict`, `error`, and `backward_gradient` functions -- Test the functions with the provided code -==== - -=== Question 4 (2 points) - -Now that we have implemented all of our helper functions, we can implement our `train` function. - -Firstly, if the argument 'reset_weights' is true, or if `reset_weights` is false but the weights and bias are not set, we will initialize our weights to a np array of zeros with the same length as the number of features in our input data. We will also initialize our bias to 0. In any other case, we will not modify the weights and bias. - -Then, this function will train the perceptron on the given training data. For each datapoint in the training data, we will get the linear combination of the input and the predicted value through our activation function. Then, we will compute the error and get the backward gradient. Then, we will calculate the gradient for our weights (simply the input times the backward gradient) and the gradient for our bias (simply the backward gradient). Finally, we will update the weights and bias by multiplying the gradients by the learning rate, and subtracting them from the current weights and bias. This process will be repeated for the set number of epochs. - -[NOTE] -==== -In this case, we are updating the weights and bias after every datapoint. This is commonly known as Stochastic Gradient Descent (SGD). Another common method is to calculate our error for every datapoint in the epoch, and then update the weights and bias based on the average error at the end of each epoch. This method is known as Batch Gradient Descent (BGD). A more sophisticated called Mini-Batch Gradient Descent (MBGD) is a combination of the two philosophies, where we group our data into small batches and update our weights and bias after each batch. This results in more weight/bias updates than BGD, but less than SGD. -==== - -In order to test your function, we will create a perceptron and train it on the Boston Housing dataset. We will then print the weights and bias of the perceptron. - -[source,python] ----- -np.random.seed(3) -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='linear') -p.train(X_train, y_train) -print(p.weights) -print(p.bias) ----- - -If you implemented the functions correctly, you should see the following output: - -[text] ----- -[-1.08035188 0.47131981 0.09222406 0.46998928 -1.90914324 3.14497775 - -0.01770744 -3.04430895 2.62947786 -1.84244828 -2.03654589 0.79672007 - -2.79553875] -[22.44124231] ----- - -.Deliverables -==== -- Implement the `train` function -- Test the function with the provided code -==== - -=== Question 5 (2 points) - -Finally, let's implement the `test` function. This function will test the perceptron on the given test data. This function should return our summary statistics from the previous project (mean squared error, root mean squared error, mean absolute error, and r squared) in a dictionary. - -To test your function, you can use the below code: -[source,python] ----- -p.test(X_test, y_test) ----- - -If you implemented the function correctly, you should see the following output: - -[text] ----- -{'mse': 19.28836923644003, - 'rmse': 4.391852597303333, - 'mae': 3.2841026903192283, - 'r_squared': 0.6875969898568428} ----- - -.Deliverables -==== -- Implement the `test` function -- Test the function with the provided code -- Reflect on what you learned from this project. Please write a short paragraph explaining the general concepts behind the logic and math behind our regression perceptron model. -- Can you think of how these perceptrons can be used in more complex models, such as neural networks? Write a few sentences explaining your thoughts. -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project10.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project11.adoc deleted file mode 100644 index 1d200c325..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project11.adoc +++ /dev/null @@ -1,385 +0,0 @@ -= 301 Project 11 - Regression: Artificial Neural Networks (ANN) - Multilayer Perceptron (MLP) -:page-mathjax: true - -== Project Objectives - -In this project, we will be taking some of what we learned from our Perceptron model and expand upon it to create a functional Artificial Neural Network (ANN) model, specifically a Multi Layer Perceptron (MLP). We will use the same dataset (Boston Housing) to compare the performance of our original perceptron with the new ANN. - -.Learning Objectives -**** -- Understand the basics of artificial neural networks -- Implement a simple artificial neural network -- Train and evaluate an artificial neural network -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -Across this project and the next one, we will be learning about and implementing neural networks. In this project, we will expand upon the perceptron model we implemented in the previous project to create a more complex model known as a Multilayer Perceptron (MLP). This MLP is a form of ANN that consists of multiple layers, where each layer consists of multiple perceptrons. In the next project, we will be implementing a convolutional neural network (CNN), which is a type of ANN that is particularly suited for data that has spatial relationships, such as images or time series. - -In this project, we will use the same dataset as the previous project to compare the performance of our original perceptron model with the new MLP model. We will use the same features and target variable as before. Please run the following code to load the dataset and split it into training and testing sets: -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Additionally, please copy these solutions for activation functions, their derivatives, and evaluation metrics into your notebook: -[source,python] ----- -import numpy as np - -def linear(x): - return x - -def linear_d(x): - return np.ones_like(np.atleast_1d(x)) - -def relu(x): - return np.maximum(x, 0) - -def relu_d(x): - return np.where(x>0, 1, 0) - -def sigmoid(x): - return 1 / (1 + np.exp(-x)) - -def sigmoid_d(x): - return sigmoid(x) * (1-x) - -def tanh(x): - return np.tanh(x) - -def tanh_d(x): - return 1 - (tanh(x)**2) - -def get_mse(y, y_pred): - return np.mean((y - y_pred) ** 2) - -def get_rmse(y, y_pred): - return np.sqrt(get_mse(y,y_pred)) - -def get_mae(y, y_pred): - return np.mean(np.abs(y - y_pred)) - -def get_r_squared(y, y_pred): - return 1 - np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2) - -derivative_functions = { - 'relu': relu_d, - 'sigmoid': sigmoid_d, - 'linear': linear_d, - 'tanh': tanh_d -} - -activation_functions = { - 'relu': relu, - 'sigmoid': sigmoid, - 'linear': linear, - 'tanh': tanh -} ----- - - -Firstly, let's discuss what the structure of a Multilayer Perceptron looks like. An MLP typically consists of an input layer, some number of hidden layers, and an output layer. Each layer consists of multiple nodes or perceptrons, each with their own weights and biases. Each node passes its output to every node in the next layer, creating a fully connected network. The diagram below shows a simple MLP with an input layer consisting of 3 input nodes, 2 hidden layers with 6 and 4 nodes respectively, and an output layer with 1 node. - -[NOTE] -==== -In our MLP, the input layer will simply be the features of our dataset, so the features will be passed directly to the first hidden layer. -==== -image::f24-301-p11-1.PNG[Example MLP, width=792, height=500, loading=lazy, title="MLP Diagram"] - -Throughout this project, we will be implementing 3 main classes: 'Node', 'Layer', and 'MLP'. The 'Node' class represents a single neuron in our network, and will store its weights, biases, and a forward function to calculate its output. The 'Layer' class represents one of the layers in our network, and stores a list of nodes, an activation method and its derivative, and a forward function to calculate the output of all nodes in the layer. The 'MLP' class represents the entire network, and stores a list of layers, a forward function to calculate the output of the entire network, a train function to train the model, and a test function to evaluate the model using our evaluation metrics. - -In this question, we will implement the 'Node' class. Please complete the following code to implement the 'Node' class: - -[source,python] ----- -class Node: - def __init__(self, input_size): - # given input size (number of features for the node): - # initialize self.weights to random values with np.random.randn - # initialize self.bias to 0 - pass - - def forward(self, inputs): - # calculate the dot product of the inputs and weights, add the bias, and return the result. Same as the perceptron model. - pass ----- - -You can test your implementation by running the following code: -[source,python] ----- -np.random.seed(11) -node = Node(3) -inputs = np.array([1, 2, 3]) -output = node.forward(inputs) -print(output) # should print -0.276386648990842 ----- - -.Deliverables -==== -- Completed Node class -- Output of the testing code -==== - -=== Question 2 (2 points) - -Next, we will implement our 'Layer' class. The 'Layer' class is slightly more complex, as it will store a list of nodes, an activation function and its derivative, and a forward function to calculate the output of all nodes in the layer and apply the activation_function. Please complete the following code to implement the 'Layer' class: - -[source,python] ----- -class Layer: - def __init__(self, num_nodes, input_size, activation='relu'): - # set self.nodes to be a list of Node objects, with length num_nodes - - # check if the activation function is supported (a key in one of the provided dictionaries). if not, raise a ValueError - - # set self.activation_func and self.activation_derivative to the correct functions from the dictionaries - pass - def forward(self, inputs): - # Create an list of the forward pass output of each node in the layer - - # Apply the activation function to the list of outputs and return the result - pass ----- - -You can test your implementation by running the following code: -[source,python] ----- -np.random.seed(11) -layer = Layer(3, 3, activation='linear') -inputs = np.array([1, 2, 3]) -output = layer.forward(inputs) -print(output) # should print [-0.27638665 -3.62878191 1.35732812] ----- - -.Deliverables -==== -- Completed Layer class -- Output of the testing code -==== - -=== Question 3 (2 points) - -Now that our Node and Layer class are correct, we can move on to implementing the 'MLP' class. This class will store our list of layers, a forward function to calculate output of the model, a train function to train the model, and a test function to evaluate the model using our evaluation metrics. In this question, we will implement just the initialization, forward, and test functions. Please begin completing the following 'MLP' class outline: - - -[source,python] ----- - -class MLP: - def __init__(self, layer_sizes, activations): - # we are given 'layer_sizes', a list of numbers, where each number is the number of nodes in the layer. - # The first layer should be the number of features in the input data - # We only need to create the hidden and output layers, as the input layer is simply our input data - # For example, if layer_sizes = [4, 5, 2], we should set self.layers = [Layer(5, 4), Layer(2, 5)] - # Additionally, we are given 'activations', a list of strings, where each string is the name of the activation function for the corresponding layer - # len(activations) will always be len(layer_sizes) - 1, as the input layer does not have an activation function - - # Please set self.layers to be a list of Layer objects, with the correct number of nodes, input size, and activation function. - pass - - def forward(self, inputs): - # for each layer in the MLP, call the forward method with the output of the previous layer - # then, return the final output - pass - - def train(self, X, y, epochs=100, learning_rate=0.0001): - for epoch in range(epochs): - for i in range(len(X)): - # Store the output of each layer in a list, starting with the input data - # You should have a list that looks like [X[i], layer1_output, layer2_output, ..., outputlayer_output] - - # find the error, target value - output value - - - # Now, we can perform our backpropagation to update the weights and biases of our model - # We need to start at the last layer and work our way back to the first layer - for j in reversed(range(len(self.layers))): - # get the layer object at index j - - # get the layer_input and layer_output corresponding to the layer. Remember, self.layers does not contain the input, but outputs list above does - - # calculate the gradient for each node in the layer - # same as the perceptron model, -error * activation_derivative(layer_output). - # However, this time it is a vector, as we are calculating the activation_derivative for everything in the layer at once - - - # Now, we must update the error for the next layer. - # This is so that we can calculate the gradient for the next layer - # This is done by taking the dot product of our gradients by the weights of each node in the current layer - - # Finally, we can update the weights and biases of each node in the current layer - # Remember, our gradient is a list, so each node in the layer will have its own corresponding gradient - # Otherwise, the process is the same as the perceptron model. - for k, node in enumerate(layer.nodes): - # update the weights and bias of the node - pass - - def test(self, X, y, methods=['mse', 'rmse', 'mae', 'r_squared']): - # Calculate metrics for each method - # First, get the predictions for each input in X - - # Then, for each method the user wants, call the corresponding function with input y and predictions - - # Finally return a dictionary with the metric as key and the result as value - - pass ----- - -To test your implementation of the initialization, forward, and test functions, you can run the following code: -[source,python] ----- -np.random.seed(11) -mlp = MLP([3, 4, 2], ['relu', 'linear']) -inputs = np.array([1, 2, 3]) -output = mlp.forward(inputs) -print(output) # should print [-1.77205771 -0.04217909] - -X = np.array([[1, 2, 3], [4, 5, 6]]) -y = np.array([[0, 1], [1, 0]]) - -metrics = mlp.test(X, y) -print(metrics) # should print {'mse': 2.698031745867772, 'rmse': 1.6425686426654358, 'mae': 1.6083905323341714, 'r_squared': -9.792126983471087} ----- -.Deliverables -==== -- Implementation of the MLP class '__init__', 'forward', and 'test' methods -- Output of the testing code -==== - -=== Question 4 (2 points) - -Now that we have all of our helper functions, we can work on training our model. This process will be very similar to the perceptron model we implemented in the previous project, but with a few key differences. Please read the helping comments in the 'train' method of the 'MLP' class and complete the code to train the model. - -To test your implementation, we will do 2 things: - -Firstly, we will test our MLP model as just a single perceptron, with the same parameters and starting weights as Questions 4 and 5 in the previous project. If everything is implemented correctly, the output of the perceptron last project and the single perceptron MLP here should be the same. -[source,python] ----- -np.random.seed(3) -mlp = MLP([X_train.shape[1], 1], ['linear']) -mlp.layers[0].nodes[0].weights = np.zeros(X_train.shape[1]) -mlp.train(X_train, y_train, epochs=100, learning_rate=0.01) -print(mlp.layers[0].nodes[0].weights) # should print the same weights as the perceptron model -print(mlp.layers[0].nodes[0].bias) # should print the same bias as the perceptron model -mlp.test(X_test, y_test) # should print the same metrics as the perceptron model ----- - - -Next, we can test our MLP model with multiple nodes and layers. - -[NOTE] -==== -Now that we have multiple nodes and layers, these code cells may take a while to run. Please be patient and give yourself enough time to run these tests. -==== - -[source,python] ----- -np.random.seed(3) -mlp = MLP([X_train.shape[1], 2, 3, 1], ['linear','linear','linear']) -mlp.train(X_train, y_train, epochs=1000, learning_rate=0.0001) -mlp.test(X_test, y_test) # should output {'mse': 17.78775654565155, 'rmse': 4.217553383853197, 'mae': 3.2032070058415836, 'r_squared': 0.7119015806656752} ----- - -.Deliverables -==== -- Implementation of the 'train' method in the 'MLP' class -- Output of perceptron model testing code -- Output of MLP model testing code -==== - -=== Question 5 (2 points) - -If you remember from the previous project, with only a single perceptron there is a limit to the how we can try to improve the model. We can train it for more epochs, or adjust its learning rate, but there isn't much beyond that. However, now that we have an MLP model, we can experiment with different numbers of layers, nodes in each layer, the activation functions of those layers, as well as the learning rate and number of epochs. - -Please experiment with different numbers of layers, number of nodes in each layer, activation functions, learning rates, and/or number of epochs. For this question, please provide a brief summary of what you tried, and what you noticed. You are not required to try and improve the metrics of the model, but you are welcome to try if you would like. - -[IMPORTANT] -==== -This model is VERY sensitive to the learning rate, number of epochs, and the number of nodes in each layer. If you are not seeing any improvement in your metrics, try adjusting these parameters. Additionally, the model may take a long time to train, so please give yourself enough time to experiment with different parameters. It is recommended to have a maximum of 3 hidden layers (not including the input and output layers) and a maximum of 10 nodes in each layer to ensure your model trains in a reasonable amount of time. -A common problem you may face is the vanishing gradient and exploding gradient problem. This is when the gradients of the weights become very small or large, respectively, and the model is unable to learn. You will know you have exploding gradients if your outputs become nan, inf, or some extremely large number. You may have vanishing gradients if your model seems to not be learning at all. Learning rate and number of epochs are the most common ways to combat these problems, but you may also need to experiment with different activation functions and the number of nodes and layers. -==== - -.Deliverables -==== -- Student has some code that shows them adjusting parameters and experimenting with different configurations -- Student has a brief summary of what they tried and what they noticed -==== - -=== Question 6 (2 points) - -Currently, we are simply filling the weights of our nodes with random values. However, depending on the activation function of the layer, we may want to initialize our weights differently to help promote model convergence and avoid potential gradient problems. There are many different weight initialization methods depending on the activation function, however there are 2 extremely popular choices: Xavier Initialization and He Initialization. These methods are described below: - -[cols="4,4,4", options="header"] -|=== -| Initialization Method | Description | Formula -| Xavier | Commonly used for tanh and sigmoid activation functions to help ensure that the variance is maintained throughout the model | $W =np.random.normal(0, np.sqrt(2/(input\_size+output\_size)), input\_size)$ -| He | Used for ReLU based activation functions to ensure that they do not vanish | $W = np.random.normal(0, np.sqrt(2/inputs), inputs)$ -|=== - -[NOTE] -The form of Xavier depicted above is for a normal distribution. However, there also exists a uniform distribution version of Xavier Initialization, with the formula $W = np.random.uniform(-\sqrt{6/(input\_size+output\_size)}, \sqrt{6/(input\_size+output\_size)}, input\_size)$. You are not required to implement this version, but you are welcome to if you would like. - -Please modify the 3 main classes to be able to change the initialization function of the weights. The MLP class will now take 3 lists as input: 'layer_sizes', 'activations', and 'initializations'. 'initializations' will be a list of strings, where each string is the name of the initialization function for the corresponding layer. The valid values for this list should be 'random', 'xavier', and 'he'. You will need to modify the 'Node' class to accept an initialization method, and modify the 'Layer' class to pass this method to the 'Node' class. You will also need to modify the 'MLP' class to pass the initialization method to the 'Layer' class. - -After you have implemented this, run the below code to visualize the distributions of the weights to confirm that they are being initialized correctly. -[source,python] ----- -np.random.seed(1) -initialized_mlp = MLP([80,80,80,80], ['relu','relu','relu'], ['random','xavier','he']) - -original_random = initialized_mlp.layers[0].nodes[0].weights -xavier = initialized_mlp.layers[1].nodes[0].weights -he = initialized_mlp.layers[2].nodes[0].weights - -import matplotlib.pyplot as plt - -plt.hist(original_random, bins=50, alpha=0.5, label='Random') -plt.hist(xavier, bins=50, alpha=0.5, label='Xavier') -plt.hist(he, bins=50, alpha=0.5, label='He') -plt.legend(loc='upper right') -plt.show() ----- - -.Deliverables -==== -- Implementation of the 'initializations' parameter in the 'MLP' class -- Modification of the 'Node' and 'Layer' classes to accept and pass the initialization method -- Output of the testing code -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project11.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project12.adoc deleted file mode 100644 index 7a32fa43b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project12.adoc +++ /dev/null @@ -1,188 +0,0 @@ -= 301 Project 12 - Regression: Bayesian Ridge Regression -:page-mathjax: true - -== Project Objectives - -In this project, we will be exploring Bayesian Ridge Regression using the scikit-learn library. We will use the beer review dataset to implement Bayesian Ridge Regression and evaluate the performance of the model using various metrics. - -.Learning Objectives -**** -- Understand the concept of Bayesian Ridge Regression -- Implement Bayesian Ridge Regression using scikit-learn -- Evaluate the performance of a Bayesian Ridge Regression model on the beer review dataset -**** - -== Supplemental Reading and Resources - -- https://medium.com/intuition/gentle-introduction-of-bayesian-linear-regression-c83da6b0d1f7[Medium Article on Bayesian Linear Regression] - -== Dataset - -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -== Questions - -=== Question 1 (2 points) - -Bayes Theorem is a fundamental theorem in probability theory. Bayes Theorem allows us to invert conditional probabilities, e.g., if we know the probability of event A given event B occured, we can calculate the probability of event B given event A occured. This theorem can be used in machine learning to estimate the probability of a model parameter given the data. Traditionally, our model parameters are estimated by minimizing our loss function. However, with Bayesian Ridge Regression, the model parameters are treated as random variables, and the posterior distribution of the model parameters is estimated. This allows the model to not only make predictions, but also provide a measure of uncertainty in its predictions. Due to the heavy mathematical nature of Bayesian Ridge Regression, we will not be writing it from scratch in this project. Instead, we will be using the scikit-learn library to implement it. If you would like to learn about the mathematical details of this model, please read the supplemental reading. - -Firstly, let's load the beer reviews sample data. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/beer/reviews_sample.csv') - -df.dropna(subset=['look','smell','taste','feel','overall', 'score'], inplace=True) -X = df[['look','smell','taste','feel', 'overall']] -y = df[['score']] - - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy().ravel() -y_test = y_test.to_numpy().ravel() ----- - -Additionally, load our metric functions by running the code below. -[source,python] ----- -import numpy as np -def get_mse(y, y_pred): - return np.mean((y - y_pred) ** 2) - -def get_rmse(y, y_pred): - return np.sqrt(get_mse(y,y_pred)) - -def get_mae(y, y_pred): - return np.mean(np.abs(y - y_pred)) - -def get_r_squared(y, y_pred): - return 1 - np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2) ----- - -Next, we will create an instance of scikit-learn's `BayesianRidge` class and fit the model to the training data. We will then use the model to make predictions on the test data. -[source,python] ----- -from sklearn.linear_model import BayesianRidge - -model = BayesianRidge() -model.fit(X_train, y_train) - -y_pred = model.predict(X_test) ----- - -Please calculate and print the mean squared error of the Bayesian Ridge Regression model on the test data, and also the output of the RMSE of the model. - -.Deliverables -==== -- Mean Squared Error of the Bayesian Ridge Regression model on the test data -- Output of the RMSE of the model -==== - -=== Question 2 (2 points) - -A powerful ability of the Bayesian Ridge Regression model, as mentioned earlier, is its ability to provide uncertainty estimates in its predictions. Through scikit-learn, we can access the standard deviation of the posterior distribution of the model parameters. From this, we know the uncertainty in our prediction, allowing us to graph confidence intervals around our predictions. - -Firstly, train the Bayesian Ridge Regression model with only the 'smell', 'taste', and 'feel' columns as our features, using the below code: -[source,python] ----- -X_train_3 = X_train[:,[1,2,3]] -X_test_3 = X_test[:,[1,2,3]] - -model = BayesianRidge() -model.fit(X_train_3, y_train) ----- - -Now, write code to get the y_predictions on the test set, and graph the y_test values and y_pred values on a single graph using matplotlib (you should be familiar with this syntax from previous projects, look back to those if you need a refresher). -[IMPORTANT] -==== -If we leave these unsorted and try to graph it, it will be a complete mess due to the points be randomly selected by the train_test_split function. By sorting one of the arrays, we can graph the points in a more orderly fashion. You can run the below code to sort both the y_test and y_pred arrays based on the y_test values from smallest to largest. -==== -[source,python] ----- -# sort the y_test array from smallest to largest, and use that order to sort the y_pred array -y_test_sorted = y_test[np.argsort(y_test)] -y_pred_sorted = y_pred[np.argsort(y_test)] ----- - -You may notice that the graph is a bit messy as the predictions are not perfect. To get a better visualization, we can overlay our confidence intervals on the graph. A confidence interval is a range of values that is some percentage likely to contain the true value. For example, a 95% confidence interval around a predicted value means that we are 95% confident that the true value lies within that range. The number of standard deviations away from the mean (or predicted value) determines the confidence level. Below is a table of the number of standard deviations and their corresponding confidence levels. Additionally, you can use the following formula to calculate the number of standard deviations away from the mean for a given confidence level: -[cols="2,2", options="header"] -|==== -|Number of Standard Deviations | Confidence Level -|1 | 68.27% -|2 | 95.45% -|3 | 99.73% -|4 | 99.994% -|==== - -How do we get these confidence levels from the model? scikit_learn makes it very easy, by providing an optional argument to the `predict` method. By setting the `return_std` argument to True, the predict method will return a tuple of the list of predictions and a list of the standard deviations for each prediction. Then, we can use the standard deviations to calculate the confidence intervals. - -In order to graph the confidence intervals, you will need to calculate the upper and lower bounds of the confidence interval for each prediction. Then, you can use the matplotlib `fill_between` function to fill in the area between the upper and lower bounds. Please graph the y_test values and the 68.27% confidence intervals of the y_pred values on the same graph. - -.Deliverables -==== -- Graph of the y_test values against the y_pred values -- Graph displaying the y_test values and the 68.27% confidence intervals of the y_pred values -==== - -=== Question 3 (2 points) - -Now that you know how to use the Bayesian Ridge Regression model to get uncertainty estimates in your predictions, let's see how changing other model parameters can affect both our model's performance and uncertainty. The `BayesianRidge` class has several parameters that can be tuned to improve the model's performance. A list of these parameters can be found in the scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html. For the next 3 questions, we will be exploring the following parameters: -Question 3: `n_iter` - The number of iterations to run the optimization algorithm. The default value is 300. -Question 4: `alpha_1` and `alpha_2 - The shape and inverse scale parameters for the Gamma distribution prior over the alpha parameter. The default values are 1e-6. -Question 5: `lambda_1` and `lambda_2` - The shape and inverse scale parameters for the Gamma distribution prior over the lambda parameter. The default values are 1e-6. - -For this question, please generate 5 models with different values of `n_iter` ranging from 100 to 500 in increments of 100. For each model, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. - -.Deliverables -==== -- RMSE for each model -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for all models -- How does the n_iter parameter affect the model's rmse and uncertainty? -==== - -=== Question 4 (2 points) - -For this question, please select 5 different `alpha_1` values. Then, for each of these values, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. Do the same for the `alpha_2` parameter. - -.Deliverables -==== -- RMSE for each model with a different alpha_1 value -- RMSE for each model with a different alpha_2 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different alpha_1 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different alpha_2 value -- How do the alpha_1 and alpha_2 parameters affect the model's rmse and uncertainty? -==== - -=== Question 5 (2 points) - -For this question, please select 5 different `lambda_1` values. Then, for each of these values, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. Do the same for the `lambda_2` parameter. - -.Deliverables -==== -- RMSE for each model with a different lambda_1 value -- RMSE for each model with a different lambda_2 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different lambda_1 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different lambda_2 value -- How do the lambda_1 and lambda_2 parameters affect the model's rmse and uncertainty? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project12.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project13.adoc deleted file mode 100644 index da311d348..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project13.adoc +++ /dev/null @@ -1,337 +0,0 @@ -= 301 Project 13 - Hyperparameter Tuning -:page-mathjax: true - -== Project Objectives - -In this project, we will be exploring different methods for hyperparameter tuning, and applying them to models from previous projects. - -.Learning Objectives -**** -- Understand the concept of hyperparameters -- Learn different methods for hyperparameter tuning -- Apply hyperparameter tuning to a Random Forest Classifier and a Bayesian Ridge Regression model -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -Hyperparameters are parameters that are not learned by the model, but are rather set by the user before training. They include parameters you should be familiar with, such as the learning rate, the number of layers in a neural network, the number of trees in a random forest, etc. There are many different methods for tuning hyperparameters, and in this project we will explore a few of the common methods. - -[NOTE] -==== -Typically, hyperparameter tuning would be performed with a small subset of the data, and the best or top n models would be selected for further evaluation on the full dataset. For the purposes of this project, we will be using the full dataset for hyperparameter tuning, as the dataset is small enough to do so. -==== - -Firstly, let's load both the iris dataset and boston housing dataset. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) -X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_scaled, y, test_size=0.2, random_state=20) -y_train_iris = y_train_iris.to_numpy() -y_test_iris = y_test_iris.to_numpy() - -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] -scaler = StandardScaler() -X = scaler.fit_transform(X) -X_train_boston, X_test_boston, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) -y_train_boston = y_train.to_numpy() -y_test_boston = y_test.to_numpy() -y_train_boston = y_train_boston.ravel() -y_test_boston = y_test_boston.ravel() ----- - -Random search is a hyperparameter optimization algorithm that, as you might guess from the name, randomly selects hyperparameters to evaluate from a range of possible values. This algorithm is very simple to implement, but greatly lacks the efficiency of other search algorithms. It is often used as the baseline to compare more advanced algorithms against. - -The algorithm is as follows: - -1. Define a set of values for each hyperparameter to search over -2. For a set number of iterations, randomly select values from the set for each hyperparameter -3. Train the model with the selected hyperparameter values -4. Evaluate the model -5. Repeat steps 2-4 for the specified number of iterations -6. Pick the best model - -For this question, you will implement the following function: -[source,python] ----- -import numpy as np -def random_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - np.random.seed(2) - # Initialize best score - best_score = -np.inf - best_model = None - best_params = None - - # Loop over number of iterations - for i in range(n_iter): - # Randomly select hyperparameters with np.random.choice. for each param: valid_choices_list pair in param_dict - # this should result in a dictionary of hyperparameter: value pairs - params = {} - # your code here to fill in params - - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # Update best model if necessary - # your code here - - return best_model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. -[source,python] ----- -# Test case 1 with iris dataset -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': [100, 200, 300, 400, 500], 'alpha_1': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'alpha_2': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'lambda_1': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'lambda_2': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]} - -best_model, best_params, best_score = random_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': [10, 50, 100, 200, 500], 'max_depth': [None, 5, 10, 20, 50], 'min_samples_split': [2, 5, 10, 20, 50], 'min_samples_leaf': [1, 2, 5, 10, 20]} -best_model, best_params, best_score = random_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Random Search -==== - -=== Question 2 (2 points) - -Grid search is another hyperparameter optimization algorithm that is more systematic than random search. It evaluates all possible combinations of hyperparameters within a specified range. This algorithm is very simple to implement, but can be computationally expensive, especially with a large number of hyperparameters and values to search over. - - -The algorithm is as follows: -1. Compute every combination of hyperparameters -2. Train the model with a combination -3. Evaluate the model -4. Repeat steps 2-3 for every combination -5. Pick the best - -[source,python] ----- -from itertools import product - -def grid_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - # Initialize best score - best_score = -np.inf - best_model = None - best_params = None - - # find every combination and store it as a list - # HINT: if you put * before a list, it will unpack the list into individual arguments - combinations = # your code here - - # now that we have every combination of values, repack it into a list of dictionaries (param: value pairs) using zip - param_combinations = # your code here - - # Loop over every combination - for params in param_combinations: - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # Update best model if necessary - # your code here - - return best_model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. -[source,python] ----- -# Test case 1 with iris dataset -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': [100, 200, 300], 'alpha_1': [1e-6, 1e-5, 1e-4], 'alpha_2': [1e-6, 1e-5, 1e-4], 'lambda_1': [1e-6, 1e-5, 1e-4], 'lambda_2': [1e-6, 1e-5, 1e-4]} - -best_model, best_params, best_score = grid_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': [100, 200, 500], 'max_depth': [10, 20, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 5]} -best_model, best_params, best_score = grid_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Grid Search -==== - -=== Question 3 (2 points) - -Bayesian optimization is a more advanced hyperparameter optimization algorithm that uses a probabilistic model to predict the performance of a model with a given set of hyperparameters. It then uses this model to select the next set of hyperparameters to evaluate. This algorithm is more efficient than random search and grid search, but significantly more complex to implement. - -The algorithm is as follows: -1. Define a search space for each hyperparameter to search over -2. Define an object function that takes hyperparameters as an input and scores the model (set_params, fit, score) -3. Run the optimization algorithm to find the best hyperparameters - -For this question, we will be using scikit-optimize, a library designed for model-based optimization in python. Please run the following code cell to install the library. -[source,python] ----- -pip install scikit-optimize ----- - -[NOTE] -==== -You may need to restart the kernel after the installation is complete. -==== - -For this question, you will implement the following function: -[source,python] ----- -from skopt import gp_minimize -from skopt.space import Real, Integer -from skopt.utils import use_named_args - -def bayesian_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - # For each hyperparameter in param_dict, we need to create a Real or Integer object and add it to the space list. - # both of these classes have the following parameters: low, high, name. Real is for continuous hyperparameters that have floating point values, and Integer is for discrete hyperparameters that have integer values. - # so, for example, if {'max_iter': (1,500), 'alpha_1': (1e-6,1e-2)} is passed in for param_dict: - # We should create an Integer(low=1, high=500, name='max_iter') object for the first param, as it uses integer values - # and a Real(low=1e-6, high=1e-2, name='alpha_1') object for the second param, as it uses floating point values - # - # All of these objects should be added to the space list - - space = [] - # your code here - - # Define the objective function - @use_named_args(space) - def objective(**params): - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # as this is a minimization algorithm, it thinks lower scores are better. Therefore, we need to return the negative score - return -score - - # Run the optimization - res = gp_minimize(objective, space, n_calls=n_iter, random_state=0) - - # Get the best parameters - best_params = dict(zip(param_dict.keys(), res.x)) - best_score = -res.fun - - return model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. - -[source,python] ----- -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': (1,50), 'alpha_1': (1e-6,1e-2), 'alpha_2': (1e-6,1e-2), 'lambda_1': (1e-6,1e-2), 'lambda_2': (1e-6,1e-2)} - -best_model, best_params, best_score = bayesian_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': (100,500), 'max_depth': (5,50), 'min_samples_split': (1,20), 'min_samples_leaf': (1,10)} -best_model, best_params, best_score = bayesian_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Bayesian Search -==== - -=== Question 4 (2 points) - -Now that we have implemented these three hyperparameter tuning algorithms, let's compare their performance to each other. For this question, please apply all three tuning algorithms to a Bayesian Ridge Regression model on the boston housing dataset. In addition to their scores, please also compare the time it takes to run each algorithm. Graph these results using 2 bar charts, one for score and one for time. - -[NOTE] -==== -The Bayseian Ridge Regression model will have a very similar accuracy for all three tuning algorithms. Please have the y-axis of the score plot be adjusted to be from 0.690515 to 0.690517 with axis.set_ylim(0.690515, 0.690517) -==== - -.Deliverables -==== -- Bar charts displaying the scores and times for each hyperparameter tuning algorithm -==== - -=== Question 5 (2 points) - -There are still many other hyperparameter methods that we have not explored. For example, you could have a more complex grid search, a more advanced Bayesian optimization algorithm, or even a genetic algorithm. For this question, please identify, explain, and implement another hyperparameter tuning algorithm. Repeat your code from question 4, but include the new algorithm. How does this algorithm compare to the other three? - -.Deliverables -==== -- Explanation of your new hyperparameter tuning algorithm -- Bar charts displaying the scores and times for each hyperparameter tuning algorithm, including the new algorithm -- Explanation of how the new algorithm compares to the other three -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project13.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project14.adoc deleted file mode 100644 index 97e698e8a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project14.adoc +++ /dev/null @@ -1,89 +0,0 @@ -= TDM 30100: Project 14 -- 2024 - -**Motivation:** We covered a _lot_ this semester, including machine learning, classifiers, regression, and neural networks. We hope that you have had the opportunity to learn a lot, and to improve your data science skills. For our final project of the semester, we want to provide you with the opportunity to give us your feedback on how we connected different concepts, built up skills, and incorporated real-world data throughout the semester, along with showcasing the skills you learned throughout the past 13 projects! - -**Context:** This last project will work as a consolidation of everything we've learned thus far, and may require you to back-reference your work from earlier in the semester. - -**Scope:** reflections on Data Science learning - -.Learning Objectives: -**** -- Reflect on the semester's content as a whole -- Offer your thoughts on how the class could be improved in the future -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -The Data Mine team is writing a Data Mine book to be (hopefully) published in 2025. We would love to have a couple of paragraphs about your Data Mine experience. What aspects of The Data Mine made the biggest impact on your academic, personal, and/or professional career? Would you recommend The Data Mine to a friend and/or would you recommend The Data Mine to colleagues in industry, and why? You are welcome to cover other topics too! Please also indicate (yes/no) whether it would be OK to publish your comments in our forthcoming Data Mine book in 2025. - -.Deliverables -==== -Feedback and reflections about The Data Mine that we can potentially publish in a book in 2025. -==== - -=== Question 2 (2 pts) - -Reflecting on your experience working with different projects, which one did you find most enjoyable, and why? Illustrate your explanation with an example from one question that you worked on. - -.Deliverables -==== -- A markdown cell detailing your favorite project, why, and a working example and question you did involving that project. -==== - -=== Question 3 (2 pts) - -While working on the projects, how did you validate the results that your code produced? Are there better ways that you would suggest for future students (and for our team too)? Please illustrate your approach using an example from one problem that you addressed this semester. - -.Deliverables -==== -- A few sentences in a markdown cell on how you conducted your work, and a relevant working example. -==== - -=== Question 4 (2 pts) - -Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please cite specific questions and explain both how they confused you and how you would recommend improving them. - -.Deliverables -==== -- A few sentences in a markdown cell on which questions from projects you found confusing, and how they could be written better/more clearly, along with specific examples. -==== - -=== Question 5 (2 pts) - -Please identify 3 skills or topics related to ML, classifiers, regression, neural networks, etc., or data science (in general) that you wish we had covered in our projects. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - -.Deliverables -==== -- A markdown cell containing 3 skills/topics that you think we should've covered in the projects, and an example of why you believe these topics or skills could be relevant and beneficial to students going through the course. -==== -=== OPTIONAL but encouraged: - -Please connect with Dr Ward on LinkedIn: https://www.linkedin.com/in/mdw333/ - -and also please follow our Data Mine LinkedIn page: https://www.linkedin.com/company/purduedatamine/ - -and join our Data Mine alumni page: https://www.linkedin.com/groups/14550101/ - - - -== Submitting your Work - -If there are any final thoughts you have on the course as a whole, be it logistics, technical difficulties, or nuances of course structuring and content that we haven't yet given you the opportunity to voice, now is the time. We truly welcome your feedback! Feel free to add as much discussion as necessary to your project, letting us know how we succeeded, where we failed, and what we can do to make this experience better for all our students and partners in 2025 and beyond. - -We hope you enjoyed the class, and we look forward to seeing you next semester! - -.Items to submit -==== -- firstname_lastname_project14.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project2.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project2.adoc deleted file mode 100644 index 135421d3a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project2.adoc +++ /dev/null @@ -1,202 +0,0 @@ -= TDM 30100: Project 02 - Intro to ML - Basic Concepts - -== Project Objectives - -In this project, we will learn how to select an appropriate machine learning model. Understanding specifics of how the models work may help in this process, but other aspects can be investigated for this. - -.Learning Objectives -**** -- Learn the difference between classification and regression -- Learn the difference between supervised and unsupervised learning -- Learn how our dataset influences our model selection -**** - -== Supplemental Reading and Resources - -- https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/[DataMine Examples Book - Choosing a Model] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170339559901081[Probabilistic Machine Learning: An Introduction by Kevin Murphy] - -== Datasets - -- `/anvil/projects/tdm/data/iris/Iris.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -[NOTE] -==== -The Iris dataset is a classic dataset that is often used to introduce machine learning concepts. You can https://www.kaggle.com/uciml/iris[read more about it here]. -If you would like more information on the boston dataset, https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset[please read here]. -==== - -== Questions - -=== Question 1 (2 points) - -In this project, we will use the Iris dataset and the boston dataset as samples to learn about the various aspects that go into choosing a machine learning model. Let's review last project by loading the Iris and boston datasets, then printing the first 5 rows of each dataset. - -.Deliverables -==== -- Output of running code to print the first 5 rows of both datasets. -==== - -=== Question 2 (2 points) - -One of the most distinguishing features of machine learning is the difference between classification and regression. - -Classification is the process of predicting a discrete class label. For example, predicting whether an email is spam or not spam, whether a patient has a disease or not, or if an animal is a dog or a cat. - -Regression is the process of predicting a continuous quantity. For example, predicting the price of a house, the temperature tomorrow, or the weight of a person. - -[NOTE] -==== -Some columns may be misleading. Just because a column is a number does not mean it is a regression problem. One-hot encoding is a technique used to convert categorical variables into numerical variables (we will cover this deeper in future projects). Therefore, it is important to try and understand what a column represents, as just seeing a number does not necessarily mean it corresponds to a continuous quantity. -==== - -Let's look at the `Species` column of the Iris dataset, and the `MEDV` column of the boston dataset. Based on these columns, classify the type of machine learning problem that we would be solving with each dataset. - -Here's a trickier example: If we have an image of some handwritten text, and we want to predict what the text says, would we be solving a classification or regression problem? Why? - -.Deliverables -==== -- Would we likely be solving a classification or regression problem with the `Species` column of the Iris dataset? Why? -- Would we likely be solving a classification or regression problem with the `MEDV` column of the boston dataset? Why? -- Would we likely be solving a classification or regression problem with the handwritten text example? Why? -==== - -=== Question 3 (2 points) - -Another important distinction in machine learning is the difference between supervised and unsupervised learning. - -Supervised learning is the process of training a model on a labeled dataset. The model learns to map some input data to an output label based on examples in the training data. The Iris dataset is a great example of a supervised learning problem. Our dataset has columns such as `SepalLengthCm`, `SepalWidthCm`, `PetalLengthCm`, and `PetalWidthCm` that contain information about the flower. Additionally, it has a column labeled `Species` that contains the label we want to predict. From these columns, the model can associate the features of the flower with the labeled species. - -We can think of supervised learning as already knowing the answer to a problem, and working backwards to understand how we got there. For example, if we have a a banana, apple, and grape in front of us, we can look at each fruit and their properties (shape, size, color, etc.) to learn how to distinguish between them. We can then use this information to predict a fruit from just its properties in the future. - -For example, given this table of data: -[cols="3,3,3",options="header"] -|=== -| Color | Size | Label -| Yellow | Small | A -| Red | Medium | B -| Red | Large | B -| Yellow | Medium | A -| Yellow | Large | B -| Red | Small | B -|=== - -You should be able to describe a relationship between the color and size, and the resulting label. If you were told an object is yellow and extra large, what would you predict the label to be? - -[IMPORTANT] -==== -The projects in 30100 and 40100 will focus on supervised learning. From our dataset, there will be a single column we want to predict, and the rest will be used to train the model. The column we want to predict is called the label/target, while the remaining columns are called features. -==== - -Unsupervised learning is the process of training a model on an unlabeled dataset. As opposed to the model trying to predict an output variable, the model instead learns patterns in the data without any guidance. This is often used in clustering problems, eg. a store wants to group items based on how often they are purchased together. Examples of this can be seen commonly in recommendation systems (have you ever noticed how Amazon always seems to know what you want to buy?). - -If we had a dataset of fruits that users commonly purchase together, we could use unsupervised learning to create groups of fruits to recommend to users. We don't need to know the answer for what to recommend to the user beforehand; we are simply looking for patterns in the data. - -For example, given the following dataset of shopping carts: -[cols="3,3,3",options="header"] -|=== -| Item 1 | Item 2 | Item 3 -| Apple | Banana | Orange -| Apple | Banana | Orange -| Apple | Grape | Kiwi -| Banana | Orange | Apple -| Orange | Banana | Apple -| Cantelope | Watermelon | Honeydew -| Cantelope | Apple | Banana -|=== - -We could use unsupervised learning to recommend fruits to users right before they check out. If a user had an orange and banana in their cart, what fruit would we recommend to them? - - -.Deliverables -==== -- Predicted label for an object that is yellow and extra large in the table above. -- What fruit would we recommend to a user who has an orange and banana in their cart? -- Should we use supervised or unsupervised learning if we want to predict the `Species` of some data using the Iris dataset? Why? -==== - -=== Question 4 (2 points) - -Another important tradeoff in machine learning is the flexibility of the model versus the interpretability of the model. - -A model's flexibility is defined by its ability to capture complex relationships within the dataset. This can be anything from - -Imagine a simple function `f(x) = 2x`. This function is very easy to interpret, it simply doubles x. However, it is not very flexible, as doubling the input is all it can do. A piecewise function like `f(x) = { x < 5: 2x^2 + 3x + 4, x >= 5: 4x^2 - 7 }` is considered more flexible, because it can model more complex relationships. However it, becomes much more difficult to understand the relationship between the input and output. - -We can also see this complexity increase as we increase the number of variables. `f(x)` will typically be more interpretable than `f(x,y)`, which will typically be more interpretable than `f(x,y,z)`. When we get to a large number of variables, eg. `f(a,b,c,...,x,y,z)`, it can become difficult to understand the impact of each variables on the output. However, a function that captures all of these variables can be very flexible. - -Machine learning models can be imagined in the same way. Many factors, including the type of model and the number of features can impact the interpretability of the model. A function that can accurately capture the relationship between a large number of features and the target variable can be extremely flexible but not understandable to humans. A model that performs some simple function between the input and output may be very interpretable, but as the complexity of that function increases its interpretability decreases. - -An important concept in this regard is the curse of dimensionality. The general idea is that as our number of features (dimensions) increases, the amount of data needed to get a good model exponentially increases. Therefore, it is impractical to have an extreme number of features in our model. Imagine given a 2d function y=f(x). Given some points that we plot, we probably pretty quickly find an approximation of f(x). However, imagine we are given y=f(x1,x2,x3,x4,x5). We would need a lot more points to find an approximation of f(x1,x2,x3,x4,x5), and understand the relationship between y and each of the variables. -Just because we can have a lot of features in our model does not mean we should. - -[NOTE] -==== -`Black box` is a term often used to describe models that are too complex for humans to easily interpret. Large neural networks can be considered black boxes. Other models, such as linear regression, are easier to interpret. Decision Trees are designed to be interpretable, as they have a very simple structure and you can easily follow along with how they operate and make decisions. These easy to interpret models are often called `white box` models. -==== - -Please print the number of columns in the Iris dataset and the boston dataset. Based purely on the number of columns, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why? - -.Deliverables -==== -- How many columns are in the Iris dataset? -- How many columns are in the boston dataset? -- Based purely on the number of features, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why? -==== - -=== Question 5 (2 points) - -Parameterization is the idea of approximating a function or model using parameters. If we have some function `f`, and we have examples of `f(x)` for many different `x`, we can find an approximate function to represent `f`. To make this approximation, we will need to choose some function to represent `f`, along with the parameters of that function. For complex functions, this can be difficult, as we may not understand the relationship between `x` and `f(x)`, or how many parameters are needed to represent this relationship. - -A non-parametrized model does not necessarily mean that the model does not have parameters. However, it means that we don't know how many of these parameters exist or how they are used before training. The model itself will work to figure out what parameters it needs while training on the dataset. This can be visualized with splines, which are a type of curve that can be used to approximate a function. There are also non-parametrized models such as K-Nearest Neighbors Regression, which do not have a fixed number of parameters, and instead learn the function from the data. - -If we have 5 points (x, y) and want to find a function to fit these points, through parameterization we would have a single function with multiple parameters that need to be adjusted to give us the best fit. However, with splines (a form of non-parametrization), we could create a piecewise function, where each piece is a linear function between two points. This function has no parameters, and is created by the model solely based on the data. You can https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/parameterization#splines-as-an-example-of-non-parameterization[read more about splines here]. - -A commonly used non-paramtrized model is k-nearest neighbors, which classifies points by comparing them to existing points in the dataset. In this way, the model does not have any parameters, but instead only learns from the data. - -Linear regression is a parametrized model, where a linear relationship between inputs and output(s) is assumed. The data is then used to identify the values of the parameters to best fit the data. - -[NOTE] -==== -If we already have a good understanding of the data, (eg. we know it to be some linear function or second order polynomial), it is likely best to choose a parametrized model. However, if we don't have an understanding of the data, a non-parametrized model that learns the function from the data may be a better fit. -==== - -To better understand the difference, please run the following code: -[source,python] ----- -import matplotlib.pyplot as plt - -a = [1, 3, 5, 7, 9, 11, 13] -b = [9, 6, 4, 7, 8, 15, 9] -x = [1, 2, 3, 4, 5, 6, 7] - -plt.scatter(x, a, label='Function A') -plt.scatter(x, b, label='Function B') -plt.legend() -plt.xlabel('Feature X') -plt.ylabel('Label y') -plt.show() ----- - -Based on the plots shown, decide if each function would be better approximated by a parametrized or non-parametrized model. - -.Deliverables -==== -- Can you easily describe the relationship between `Feature X` and `Label y` for Function A? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function? -- Can you easily describe the relationship between `Feature X` and `Label y` for Function B? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project3.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project3.adoc deleted file mode 100644 index 4464708c2..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project3.adoc +++ /dev/null @@ -1,241 +0,0 @@ -= TDM 30100: Project 03 - Intro to ML - Data Preprocessing - -== Project Objectives - -Learn how to preprocess data for machine learning models. This includes one-hot encoding, scaling, and train-validation-test splitting. - -.Learning Objectives -**** -- Learn how to encode categorical variables -- Learn why scaling data is important -- Learn how to split data into training, validation, and test sets -**** - - -== Dataset - -- `/anvil/projects/tdm/data/fips/fips.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -The accuracy of a machine learning model depends heavily on the quality of the dataset used to train it. There are several issues you may encounter if you feed raw data into a model. We explore some of these issues in this project, as well as other necessary steps to format data for machine learning. - -The first step in preprocessing data for supervised learning is to split the dataset into input features and target variable(s). This is because, as you should have learned from project 2, supervised learning models require a dataset of input features and their corresponding target variables. By separating our dataset into these components, we are ensuring that our model is learning the relationship between the correct columns. - -Write code to load the fips dataset into a variable called 'fips_df' using pandas and separate it into 2 dataframes: one containing the input features and the other containing the target variable. Let's use the `CLASSFP` column as the target variable and the rest of the columns as input features. - -[NOTE] -==== -Typically, the dataset storing the target variable is denoted by `y` and the dataset storing the input features is denoted by `X`. This is not required, but it is a common convention that we recommend following. -==== - -To confirm your dataframes are correct, print the columns of each dataframe. - -.Deliverables -==== -- Load the fips dataset using pandas -- Separate the dataset into input features and target variable -- Print the column names of each dataframe -==== - -=== Question 2 (2 points) - -Label encoding is a technique used to change categorical variables into a number format that can be understood by machine learning models. This is necessary for models that require numerical input features (which often is the case). Another benefit of label encoding is that it can decrease the memory usage of the dataset (a single integer value as opposed to a string). - -The basic concept behind how it works is that if there are `n` unique category labels in a column, label encoding will assign a unique integer value to each category label from `0` to `n-1`. - -For example, if we have several colors we can encode them as follows: -|=== -| Color | Encoded Value - -| Red | 0 - -| Green | 1 - -| Blue | 2 - -| Yellow | 3 -|=== -where we have four (n=4) unique colors, so their encoded values range from 0 to 3 (n-1). - -[NOTE] -==== -Label encoding can lead the model to interpret the encoded values as having an order or ranking. In some cases, this is a benefit, such as encoding 'small', 'medium', and 'large' as 0, 1, and 2. However, this can sometimes lead to ordering that is not intended (such as our color example above). This is something to think about when deciding if label encoding is the right choice for a column or dataset. -==== - -Print the first 5 rows from the fips dataset. As you can see, the `CountyName` and `State` columns are categorical variables. If we were to use this dataset for a machine learning model, we would likely need to encode these columns into a numerical format. - -In this question, you will use the `LabelEncoder` class from the `scikit-learn` library to label encode the `CountyName` column from the dataset. - -Fill in and run the following code to label encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called `X`.) -[source,python] ----- -from sklearn.preprocessing import LabelEncoder - -# create a LabelEncoder object -encoder = LabelEncoder() - -# create a copy of the input features to separate the encoded columns -X_label_encoded = X.copy() - -X_label_encoded['COUNTYNAME'] = encoder.fit_transform(X_label_encoded['COUNTYNAME']) ----- - -Now that you have encoded the `COUNTYNAME` column, print the first 5 rows of the X_label_encoded dataset to see the changes. What is the largest encoded value in the `COUNTYNAME` column (i.e., the number of unique counties)? - -[NOTE] -==== -You are not required to use the same variable names (X, X_label_encoded, etc.), but following this convention is strongly recommended. -==== - -.Deliverables -==== -- Print the first 5 rows of the X dataset before encoding -- Label encode the `COUNTYNAME` column in the fips dataset -- Print the first 5 rows of the X_label_encoded dataset after encoding -- Largest encoded value in the `COUNTYNAME` column -==== - -=== Question 3 (2 points) - -As we mentioned last question, label encoding can sometimes lead to undesired hierarchies or ordering with the model. A different encoding approach that alleviates this potential issue is one-hot encoding. Instead of simply assigning a unique integer value to each label, one-hot encoding will create a new binary column for each category label. The value in the binary column will be `1` if the category label is present in the original column, and `0` otherwise. By doing this, the model will not interpret these encoded values as being related, rather as completely separate features. - -To give an example, let's look at how we would use one-hot encoding for the color example in the previous question: -|=== -| Color | Red | Green | Blue | Yellow - -| Red | 1 | 0 | 0 | 0 - -| Green | 0 | 1 | 0 | 0 - -| Blue | 0 | 0 | 1 | 0 - -| Yellow | 0 | 0 | 0 | 1 -|=== -We have four unique colors, so one-hot encoding gives us four new columns to represent these colors. - -The `scikit-learn` library also provides a `OneHotEncoder` class that can be used to one-hot encode categorical variables. In this question, you will use this class to one-hot encode the `STATE` column from the dataset. - -First, print the dimensions of the X dataset to see how many rows and columns are in the dataset before one-hot encoding. - -Run the following code to one-hot encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called `X`.) -[source,python] ----- -from sklearn.preprocessing import OneHotEncoder - -# create a OneHotEncoder object -encoder = OneHotEncoder() - -# create a copy of the input features to separate the encoded columns -X_encoded = X.copy() - -# fit and transform the 'STATE' column -# additionally, convert the output to an array and then cast it to a DataFrame -encoded_columns = pd.DataFrame(encoder.fit_transform(X[['STATE']]).toarray()) - -# drop the original column from the dataset -X_encoded = X_encoded.drop(['STATE'], axis=1) - -# concatenate the encoded columns -X_encoded = pd.concat([X_encoded, encoded_columns], axis=1) ----- - -Now that you have one-hot encoded the `STATE` column, print the dimensions of the X_encoded dataset to see the changes. You should see the same number of rows as the original dataset, but with a large amount of additional columns for the one-hot encoded variables. Are there any concerns with how many columns were created (hint, think about memory size and the curse of dimensionality)? - -.Deliverables -==== -- How many rows and columns are in the X_encoded dataset after one-hot encoding? -- How many columns were created during one-hot encoding? -- What are some disadvantages of one-hot encoding? -- When would you use one-hot encoding over label encoding? -==== - -=== Question 4 (2 points) - -For this question, let's switch over to the Boston Housing dataset. Load the dataset into a variable called `boston_df`. Print the first 5 rows of the `CRIM`, `CHAS`, `AGE`, and `TAX` columns. Then, write code to find the mean and range of values for each of these columns. - -[NOTE] -==== -You can use `max` and `min` functions to find the maximum and minimum values in a column, respectively. For example, `boston_df['AGE'].max()` will return the maximum value in the `AGE` column. -==== - -Scaling is another important preprocessing step that is often necessary when working with machine learning models. There are many approaches to this, however the goal is to ensure that all features are on a similar scale. Two common techniques are normalization and standardization. Normalization adjusts feature so that all values fall between 0 and 1. Standardization adjusts features to a set mean (typically 0) and standard deviation (typically 1). This is important because many machine learning models are sensitive to the scale of the input features. If the input features are on different scales, the model may give more weight to features with larger values, which can lead to poor performance. - -As you may guess from the previous 2 questions, the `scikit-learn` library provides a `StandardScaler` class that can be used to scale input features. This class standardizes features to a mean of 0 and a standard deviation of 1. - -Run the following code to scale the columns in the Boston dataset. (This code assumes your dataframe is stored in a variable called `boston_df`) - -[source,python] ----- -from sklearn.preprocessing import StandardScaler - -scaler = StandardScaler() - -# scale the SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm columns -X_scaled = scaler.fit_transform(boston_df[['CRIM', 'CHAS', 'AGE', 'TAX']]) - -#convert X_scaled back into a dataframe -X_scaled = pd.DataFrame(X_scaled, index=boston_df.index, columns=['CRIM', 'CHAS', 'AGE', 'TAX']) ----- - -Now that you have scaled the input features, print the mean and range of values for the 4 columns after scaling. you should see that the range of values for each column is now similar, and the mean is close to 0. - -.Deliverables -==== -- Mean and range of values for the `CRIM`, `CHAS`, `AGE`, and `TAX` columns before scaling. -- Mean and range of values for the `CRIM`, `CHAS`, `AGE`, and `TAX` columns after scaling. -- How did scaling the input features affect the mean and range of values? -==== - -=== Question 5 (2 points) - -The final step in preprocessing data for machine learning is to split the dataset into training and testing sets. The training set is the data used to train the model, and the testing set is used to evaluate the model's performance after training. - -[NOTE] -==== -Often times a validation set is also created to help tune the parameters of the model. This is not required for this project, but you may encounter it in other machine learning projects. -==== - -Again, scikit-learn provides everything we need. The `train_test_split` function can be used to split the dataset into training and testing sets. - -This function takes in the input features and target variable(s), along with the test size and randomly splits the dataset into training and testing sets. The test size is the fraction of the dataset that will be used for testing. We can also set a random state to ensure reproducibility. - -If we withhold too much data for testing, the model may not have enough data to learn from. However, if we withhold too little data, the model may become overfit to the training data, and the limited testing data may not be representative of the model's performance. Typically, a test size of 10-30% is used. - -Using our `y` dataframe from Question 1, and the `X_encoded` dataframe from Question 3, split the dataset into training and testing sets. Run the following code to split the dataset. - -[source,python] ----- -from sklearn.model_selection import train_test_split - -X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42) ----- - -[NOTE] -==== -If we wanted to create a validation set, we can use the same function to split `X_train` and `y_train` datasets into training and validation sets. -==== - -Now that you have split the dataset, print the number of rows in the training and testing sets to confirm the split was successful. - -.Deliverables -==== -- Number of rows in the training and testing sets -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project4.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project4.adoc deleted file mode 100644 index dfdc7a804..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project4.adoc +++ /dev/null @@ -1,184 +0,0 @@ -= TDM 30100: Project 04 - Classifiers - Basics of Classification -:page-mathjax: true - -== Project Objectives - -In this project, we will learn about the basics of classification. We will explore some of the most common classifiers, and their strengths and weaknesses. - - -.Learning Objectives -**** -- Learn about the basics of classification -- Learn classification specific terminology -- Learn how to evaluate the performance of a classifier -**** - -== Supplemental Reading and Resources - -- https://deepai.org/machine-learning-glossary-and-terms/classifier[Machine Learning Glossary - Classifiers] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -A classifier, as you may remember from Project 2, is a machine learning model that uses input features to classify the data. Classifiers can be used to determine if email is spam or not, or determine what kind of flower a plant is. We can split classifiers into 2 major categories: binary classification and multi-class classification. - -Binary classifiers are used when we want to classify binary outcomes, such as testing if a patient is sick or not. Multi-class classifiers are used when we want to classify more than 2 outcomes, such as a color, or a type of flower. - -[NOTE] -==== -Multi-label classifiers are a special case of multi-class classifiers, where multiple classes can be assigned to a single instance. For example, an image containing both a cat and dog would be classified as both a cat and a dog. These are commonly found in image recognition problems. -==== - -Pennsylvania State University has a great lesson on examples of classification problems. You can read about them https://online.stat.psu.edu/stat508/lessons/Lesson01#classification-problems-in-real-life[in section 1.5 here]. Please read through some of these examples, and then come up with your own real world examples of binary and multi-class classification problems. - -.Deliverables -==== -- What is a real world example of binary classification? -- What is a real world example of multi-class classification? -- Is email spam classification (spam or not spam) a binary or multi-class classification problem? -- Is digit recognition (determining numerical digits that are handwritten) a binary or multi-class classification problem? -==== - -=== Question 2 (2 points) - -There are many different classification models. In this course, we will go more in depth into the K-Nearest Neighbors (KNN) model, the Decision Tree model, and the Random Forest model. Each of these models has its own strengths and weaknesses, and is better suited for different types of data. There are many other classification models, and more methods are being developed all the time. We won't go into detail about these models in this project, but it is important to know that they exist and behave differently. - -GeeksforGeeks has a great article on different classification models and their strengths and weaknesses. You can read about them https://www.geeksforgeeks.org/advantages-and-disadvantages-of-different-classification-models/[here]. Please read through this article and then answer the following questions. - -.Deliverables -==== -- Can you name 3 other models that could be used for classification? -- Why is it important to understand the strengths and weaknesses of different classification models? -==== - -=== Question 3 (2 points) - -There are many metrics that can be used to evaluate the performance of a classifier. Some of the most common metrics are accuracy, precision, recall, and F1 score. - -In binary classification, there are 4 possible results from a classifier: - -True Positive: The classifier predicts the presence of a class, when the class is actually present -True Negative: The classifier correctly predicts the absence of a class, when the class is actually absent -False Positive: The classifier predicts the presence of a class, when the class is actually absent -False Negative: The classifier predicts the absence of a class, when the class is actually present - -Accuracy is simply the percentage of correct predictions made by the model. As we learned in Project 3, our data is split into training and testing sets. We can calculate the accuracy of our model by comparing the predicted values on the testing set to the actual values in the testing set. - -$ -\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} -$ - -Precision is a metric that tells us how many of the predictions of a certain class were actually correct. It is calculated by dividing the number of true positives by the number of true positives plus the number of false positives. - -$ -\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} -$ - -Recall is a metric that tells us how many of the actual instances of a certain class were predicted correctly. It is calculated by dividing the number of true positives by the number of true positives plus the number of false negatives. - -$ -\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} -$ - -Finally, the F1 score is the harmonic mean of precision and recall. It is calculated as 2 times the product of precision and recall divided by the sum of precision and recall. This metric is useful when we want to balance precision and recall. - -$ -\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} -$ - -Let's try an example. Given the following table of predictions and actual values, calculate the accuracy, precision, recall, and F1 score. - -[cols="3,3",options="header"] -|=== -|Actual Value |Predicted Value -|Positive |Positive -|Positive |Positive -|Negative |Positive -|Positive |Negative -|Positive |Positive -|Negative |Negative -|Negative |Positive -|Positive |Negative -|Positive |Positive -|Positive |Positive -|=== - -.Deliverables -==== -- Why is accuracy not always the best metric to evaluate the performance of a classifier? -- In your own words, what is the difference between precision and recall? -- Calculate the accuracy, precision, recall, and F1 score for the example above. -==== - -=== Question 4 (2 points) - -There are many applications of classification in the real world. One common application is in the medical field, where classifiers can be used to predict whether a patient has a certain disease based on their symptoms. Another application is in the financial industry, where classifiers can be used to predict whether a transaction is fraudulent or not. - -In more recent years, classifiers have been used in the field of image recognition. For example, classifiers can be used to determine whether an image contains a cat or a dog. More advanced classifiers, such as Haar cascades, can be used to detect faces in images by looking for patterns of light and dark pixels. - -In these uses, there often are privacy concerns associated with the data that is being used. If a company wants to develop a classifier to predict whether a transaction is fraudulent, they may need access to sensitive financial data of normal customers. In more recent times, generative image AIs have concerns about what images they were trained on, and if these artists should have their work used to train these models. - -Another issue to consider is bias within these datasets. If a model is trained on data biased towards a certain group, it may make incorrect predictions or reinforce existing biases. If a dataset contains a thousand images of cats and only 5 images of a frog, the classifier may be unable to accurately predict whether an image contains a frog, and may often times incorrectly classify images as cats. Another way bias can be found is in the training itself. A model may wind up relying on a single feature to make predictions, often times creating bias towards that feature (think race, age, income, nationality, etc). - -There are many ways to address bias in classifiers. Typically, the best way to start is to ensure that the training data is very diverse and representative of the real world. Collecting a large amount of data from a variety of sources helps to ensure that the data is not intrinsically biased. Regularization methods can be used to prevent the model from heavily relying on a single or a small number of features. Finally, fairness metrics and bias detection tools such as Google's "What-If" tool or IBM's "AI Fairness 360 (AIF360)" can be used post training to detect and mitigate biases in the model. - -.Deliverables -==== -- Can you think of any areas where there may be ethical concerns with using classifiers? -- Are there any image recognition applications that you interact with, on a daily basis? -==== - -=== Question 5 (2 points) - -Although classifiers are powerful tools, they are not without their limitations. One significant limitation is that classifiers rely heavily on the data they are trained with. If the training data is biased, incomplete, or not representative of the real world, the classifier may make incorrect predictions. - -Class imbalance is a common problem in classification, where one class has significantly more instances than another. This can lead to classifiers that are biased towards the majority class and perform poorly on the minority class. For example, if my dataset contains 99% cats and 1% dogs, a classifier may simply not have enough data to learn how to classify dogs correctly, and may often times incorrectly classify images as cats. - -An easy way to check our class balance is by creating a chart to visualize the distribution of classes in the dataset. To practice, please load the Iris dataset into a dataframe called `iris_df`. Then, run the below code to generate a pie chart displaying the class distribution. - -[source,python] ----- -import matplotlib.pyplot as plt - -# get the counts of the species column -column_counts = iris_df['Species'].value_counts() - -# graph the pie chart -column_counts.plot.pie(autopct='%1.1f%%') ----- - -*Are the classes in the Iris dataset balanced?* - -Feature engineering is another important aspect of machine learning. Feature engineering is the process of manually selecting or transforming input features in the dataset that are most relevant to the problem at hand. The more irrelevant features a classifier has to work with, the more likely it is to make incorrect predictions. - -A notable idea is the Pareto Principle (aka the 80/20 rule) is the idea that 80% of the effects can be attributed to 20% of the causes. This idea can be observed in a myriad of different situations and fields. In the context of our classification models, this theory says that 20% of our features are responsible for 80% of the predictive power of our model. By identifying what features are important, we can reduce our datasets dimensionality and make our models significantly more efficient and interpretable. - -One example of where features can be removed is in the case of multicollinearity. This is when a set of features are highly correlated with each other (i.e., the data for them is redundant). This can lead to overfitting, as the model cannot truly distinguish between the features. In this case, we can remove all but one of these correlated features to reduce our dataset's dimensionality while avoiding the problems of multicollinearity. - -We previously looked at encoding categorical variables in Project 3. There are many different ways to encode categorical variables, and the best method depends on the type of data and the model being used. This is an example of feature engineering, as we are transforming the data to a more suitable form for the model. - -.Deliverables -==== -- Are the classes in the Iris dataset balanced? -- What are some ways to address class imbalance in a dataset? -- Why is feature engineering important in classification? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project5.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project5.adoc deleted file mode 100644 index 996c5a3f0..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project5.adoc +++ /dev/null @@ -1,298 +0,0 @@ -= TDM 30100: Project 05 - Classifiers - K-Nearest Neighbors (KNN) I -:page-mathjax: true - -== Project Objectives - -In this project, we will go learn about the K-Nearest Neighbors (KNN) machine learning algorithm, develop it without the use of a library, and apply it to a small dataset. - -.Learning Objectives -**** -- Learn the mathematics behind a KNN -- Create a KNN -- Use KNN to classify data -**** - - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -First, let's learn the basics of how a KNN works. A KNN operates by calculating the difference between input features to all samples in its existing database, and performing a majority vote between the k closest samples to classify the input features. If k=1, it simply chooses the closest class. If k=3, it takes chooses the majority between the 3 nearest. If there is ever a tie, the default behavior is to select a random class from the tied classes. - -[NOTE] -==== -This random selection during a tie is not ideal, but it is a simple way to handle the case. In the next project, we will explore a way to handle ties in a more sophisticated manner. -==== - -image::f24-301-p5-1.png[KNN Distance Calculation, width=792, height=500, loading=lazy, title="KNN Distance Calculation"] - -Take the above example. Suppose we have some dataset containing 2 classes, represented by blue triangles and orange circles. If we have some unknown point (the green square), we can classify it by finding the k closest points to it and taking a majority vote. In this case, the 5 closest points are shown with dashed lines and labeled in order. - -If k=1, what would the unknown point be classified as? If k=3, what would it be classified as? If k=5, what would it be classified as? - -To think about this simply, let's look at an example with 2 input features. This dataset uses a hue and size to identify fruit. - -[cols=4*] -|=== -|#|Hue | Size| Output Variable -|1|22|1|Banana -|2|27|.9|Banana -|3|87|.05|Grape -|4|84|.03|Grape -|=== - -Given this dataset, we want to identify a fruit with Hue=24, Size=0.95. - -To find the distance between 2d points, you can use the formula - -$ -\text{dist} = \sqrt{(X-X_0)^2 + (Y-Y_0)^2} -$ - -.Deliverables -==== -- What class would our the green square be classified as if k=1? if k=3? if k=5? -- Which point is our unknown fruit closest to? (put the #) -- What fruit should our unknown fruit be classified as, assuming k=1? -- What would happen if we set k=4? -==== - -=== Question 2 (2 points) - -Now that we understand the basics of how a KNN works, let's create a KNN from scratch in python. - -We will still use pandas to load the dataset and scikit-learn to scale and split the data, but we will not use scikit-learn to create the KNN. - -First, let's load the Iris dataset, separate the data into features and labels (hint: the Species column is our target variable), scale the input features, and split the data into training and testing sets (80% training, 20% testing). - -[NOTE] -==== -Please review your work from Project 1 and Project 3 if you need a refresher on how to import a dataset, and how to scale and split data. If you did not complete project 3, please read the https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform[StandardScaler documentation] and the https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split[train_test_split documentation], or ask a TA for help during office hours. -==== - - -[source,python] ----- -# Import libraries -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -# load the dataframe into `df` -'''YOUR CODE TO LOAD THE DATAFRAME''' - -# separate the data into input features 'X' and output variable 'y'. Be sure to remove the 'Id' column from the input features -'''YOUR CODE TO SEPARATE THE DATA''' - -# scale the input features into `X_scaled` -'''YOUR CODE TO SCALE THE INPUT FEATURES''' - -# split the data into training 'X_train' and 'y_train' and testing 'X_test' and 'y_test' sets. Use a test size of 0.2 and random state of 42 -'''YOUR CODE TO SPLIT THE DATA''' ----- -[NOTE] -==== -train_test_split returns 4 variables in the order X_train, X_test, y_train, y_test. Although we provided pandas dataframes, the train_X and test_X variables will be numpy arrays. However, the y_train and y_test variables will remain pandas series. This may cause confusion in future code, so it may be helpful to convert the pandas series to numpy arrays using their `.to_numpy()` function. For example, `y_train = y_train.to_numpy()`. -==== - -*Please print the first 5 rows of the testing input features to confirm whether your data is processed correctly.* - -.Deliverables -==== -- Output the first 5 rows of the testing input features -==== - -=== Question 3 (2 points) - -Now that we have our data loaded, scaled, and split, let's start working on creating a KNN from scratch. - -Over the next 3 questions, we will fill in functions in the KNN class below that are needed to classify new data points and test the model. - -[source,python] ----- -''' -class : `KNN` -init inputs : `X_train` (list[list[float]]), `y_train` (list[str]) - -description : This class stores the training data and classifies new data points using the KNN algorithm. -''' -class KNN: - def __init__(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def train(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def euc_dist(self, point1, point2): - '''YOUR CODE TO CALCULATE THE EUCLIDEAN DISTANCE''' - pass - - def classify(self, new_point, k=1): - '''YOUR CODE TO CLASSIFY A NEW POINT''' - pass - - def test(self, X_test, y_test, k=1): - '''YOUR CODE TO TEST THE MODEL''' - pass ----- - -First, let's fill in the `euc_dist` function that calculates the Euclidean distance between two n-dimensional points. The formula for the Euclidean distance between two points is - -$ -\text{dist} = \sqrt{(X_1-X_2)^2 + (Y_1-Y_2)^2 + ... + (Z_1-Z_2)^2} -$ - -where X, Y, Z, etc. are the n-dimensional coordinates of the two points. - -We can imagine each row in our dataset as a point in n-dimensional space, where n is the number of input features. The Euclidean distance between two points is the straight-line distance between them. It can be difficult to visualize in higher dimensions, but the formula remains the same. - -The inputs for this function are `point1` and `point2`, which are each rows from our dataset. The output should be the float value of the Euclidean distance between the two points. - -[NOTE] -==== -With pandas dataframes, you can perform operations between rows. For example, if you have `row1` and `row2`, you can calculate the difference between them by running `row1 - row2`. This will return a new row with the differences between the two rows. This will be useful for calculating the Euclidean distance between two points. -==== - -One thing that you should learn how to do is test functions that you write. Instead of creating the whole KNN and making sure the code works at the very end, it is important to test each piece of code as we right it. We can create test cases to see if our function is working as expected. Some test cases have been provided to you below. For this function, please create 2-3 test cases of your own to ensure that your function works as expected. - -[NOTE] -==== -In python, we can use the `assert` statement for test cases. If we assert an expression that results in true, the code will continue like nothing happened. However, if the expression results in false, we will receive an `AssertionError`, notifying us that our function is not working as expected. -==== - -[source,python] ----- -import numpy as np -# make a knn object -knn = KNN(X_train, y_train) -# test the euc_dist function -assert knn.euc_dist(np.array([1,2,3]), np.array([1,2,3])) == 0 -assert knn.euc_dist(np.array([1,2,3]), np.array([1,2,4])) == 1 -assert knn.euc_dist(np.array([0,0]), np.array([3,4])) == 5 -# your test cases here: - ----- - -*To test that your function works, calculate the Euclidean distance between the first two rows of the training input features by running the code below.* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.euc_dist(X_train[0], X_train[1])) ----- - -.Deliverables -==== -- Your own test cases for the `euc_dist` function -- Output of calculating the euclidean distance between the first two rows of the training input features -==== - -=== Question 4 (2 points) - -Now that we have a function to calculate the Euclidean distance between two points, let's work on the `classify` function, which will classify a new point using the KNN algorithm. - -To classify a point, we need to calculate the Euclidean distance between the new point and all points in the training data. Then, we can find the `k` closest points and take a majority vote to classify the new point. - -Fill in the `classify` function to classify a new point using the KNN algorithm. If there is a tie, randomly select a class. - -[IMPORTANT] -==== -Since our features and labels are stored in separate variables, it is recommended that you use the `zip` function to iterate over both lists simultaneously. For example, given A=[1,2,3,4] and B=[5,6,7,8], you can use zip(A,B) to create a list [(1,5), (2,6), (3,7), (4,8)]. This will allow you to repackage the features and labels into a single list. -==== - -[NOTE] -==== -To find the `k` closest points, we recommend you to use the `sorted` function with a lambda function as the key. For example, to sort a list in ascending order, you can run `sorted(list, key=lambda x: 'some function involving element x')`. This lambda essentially says for each element x in the list, get a value by running some function and sort based on that value. Another hint is that the 'some function involving element x' should be a function you wrote in the last question... -==== - -Below is some pseudocode to help you get started on the `classify` function. -[source,python] ----- -def classify(self, new_point, k=1): - # combine features and labels into a single list - ### YOUR CODE HERE ### - - # sort the list by the euclidean distance between each point and the new point - ### YOUR CODE HERE ### - - # get the k closest points - ### YOUR CODE HERE ### - - # get the labels of the k closest points - ### YOUR CODE HERE ### - - # find the majority class - ### YOUR CODE HERE ### ----- - - -*To test that your function works, classify the first row of the testing input features using the KNN algorithm with k=3 by running the code below. You should get a classification of `Iris-versicolor`* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.classify(X_test[0], k=3)) ----- - -.Deliverables -==== -- Classification of the first row of the testing input features using the KNN algorithm with k=3 -==== - -=== Question 5 (2 points) - -Now that we are able to classify a single point, let's work on the `test` function, which will test the model on a dataframe of input features and output variables. - -For this function, we simply need to iterate over all points in our input features, classify each point, and compare their classification to the actual output variable. We can then calculate the accuracy of our model by dividing the number of correct classifications by the total number of classifications. - -Below is some pseudocode to help you get started on the `test` function. -[source,python] ----- -def test(self, X_test, y_test, k=1): - # for each point in X_test - ### YOUR CODE HERE ### - # classify the point - ### YOUR CODE HERE ### - - # compare the classification to the actual output variable - # if the classification is correct, increment a counter - ### YOUR CODE HERE ### - - # calculate and return the accuracy of the model - ### YOUR CODE HERE ### ----- -*To test that your function works, test the model on the testing input features and output variables using the KNN algorithm with k=1 by running the code below. You should get an accuracy of 0.9666666666666667* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.test(X_test, y_test, k=1)) ----- - -.Deliverables -==== -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project6.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project6.adoc deleted file mode 100644 index 904d66c3e..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project6.adoc +++ /dev/null @@ -1,236 +0,0 @@ -= TDM 30100: Project 06 - Classifiers - K-Nearest Neighbors (KNN) II - -== Project Objectives - -In this project, we will learn about more advanced techniques for K-Nearest Neighbors (KNN), and continue building our KNN from scratch. - -.Learning Objectives -**** -- Learn about feature engineering -- Learn about better ways to handle ties in KNN -**** - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -In the previous project, we developed a KNN class that is able to classify new data points. If you completed the previous project, you should have a basic understanding of how a KNN works. - -In this question, we will briefly recap last project's code and concepts. Please run the following code to load the Iris dataset, scale the input features, and split the data into training and testing sets. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Then, please run the following code to import the KNN class. If you did the previous project, please use your own KNN class. If you did not complete the previous project, please use the following code to import the KNN class. -[source,python] ----- -class KNN: - def __init__(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def train(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def euc_dist(self, point1, point2): - # short 1 line approach - return sum((point1 - point2) ** 2) ** 0.5 - - def classify(self, new_point, k=1): - # sort the combined list by the distance from the point in the list to the new point - nearest_labels = [x[1] for x in sorted(zip(self.features, self.labels), key=lambda x: self.euc_dist(x[0], new_point))[:k]] - return max(set(nearest_labels), key=nearest_labels.count) - - def test(self, X_test, y_test, k=1): - # short 1 line approach, efficent list comprehension - return sum([1 for p in zip(X_test, y_test) if self.classify(p[0],k=k)==p[1]])/len(X_test) ----- - -To review the concepts of the KNN algorithm, please answer the following questions. - -.Deliverables -==== -- What is the purpose of the `train` function in the KNN class? -- How does a KNN pick which k neighbors to use when classifying a new point? -- How does a KNN handle ties when classifying a new point? -==== - -=== Question 2 (2 points) - -To review, the KNN works entirely by calculating the Euclidean distance between points in n-dimensional space. This means that scaling our input features is very important, as features with larger scales will have a larger impact on the distance calculation. - -However, uniformly scaling our features may not be the best approach. If we wanted to identify the difference between a red apple and a green apple, the most important feature would be the color of the apple. Therefore, we would want to scale the color feature more than the size feature. - -[NOTE] -==== -This concept of manually assigning the importance of features is an example of feature engineering. We can use existing knowledge (or often times intuition) to determine how important each feature should be in the model. This can greatly improve our model's performance if done right, and can also often lead to a more interpretable model. -==== - -Let's create a new function inside the KNN class that will calculate the euclidean distance between two points, but will take a list of weights to determine how important each feature is. This will allow us to scale the distance between two points based on the importance of each feature. - -[source,python] ----- -def scaled_distance(self, point1, point2, weights): - # firstly, scale weights so they sum to 1 (each weight should be a fraction of the total 1) - ''' YOUR CODE HERE ''' - - # then, scale the 2 points by the weights (multiply each feature in the point by the corresponding weight) - ''' YOUR CODE HERE ''' - - # finally, calculate and return the euclidean distance between the 2 points (use the existing euc_dist function) - ''' YOUR CODE HERE ''' - pass ----- - -*To ensure your implementation is correct, run the below code to calculate the scaled distance between the first row of the training set and the second row of the testing set. The printed values should be 1.4718716551800963 for euc_dist and 0.15147470763642634 for scaled_distance.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.euc_dist(X_train[0], X_test[1])) -print(knn.scaled_distance(X_train[0], X_test[1], [1,1,1,10])) ----- - -.Deliverables -==== -- Output of running the sample code to confirm correct implementation of the scaled_distance function -- What does the distance decreasing when we raised the weight of the last feature mean? -==== - -=== Question 3 (2 points) - -Now that we have code to scale the distance between two points based on the importance of each feature, let's write two functions inside the KNN class to classify a point using weights, and to test the model using weights. - -[NOTE] -==== -These functions will be extremely similar to the existing classify and test functions, but use the scaled_distance function instead of the euc_dist function. -==== - -[source,python] ----- -def classify_weighted(self, new_point, k=1, weights=None): - ''' If weights == None, run the existing classify function ''' - - # now, write the classify function using the scaled_distance function - ''' YOUR CODE HERE ''' - -def test_weighted(self, X_test, y_test, k=1, weights=None): - ''' YOUR CODE TO TEST THE MODEL ''' - pass ----- - -*To test that your functions work, please run the below code to calculate the accuracy of the model with different weights. Your accuracies should be 0.9666666666666667, 0.9666666666666667, and 0.8333333333333334 respectively.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,1])) -print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,10])) -print(knn.test_weighted(X_test, y_test, k=1, weights=[10,1,1,1])) ----- -.Deliverables -==== -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,1] -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,10] -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[10,1,1,10] - -- Does the accuracy of the model change when we change the weights? Why or why not? -==== - -=== Question 4 (2 points) - -One potential limitation of the KNN is that we are simply selecting the class based on the majority of the k nearest neighbors. Suppose we attempt to classify some point with k=3. Suppose this results in finding 2 neighbors of class A and 1 neighbor of class B. In this case, the KNN would classify the point as class A. However, what if the 2 neighbors of class A are very far away from our new point, while the class B neighbor is extremely close? It would probably make more sense to classify the point as class B. - -Additionally, suppose our dataset is unbalanced. We may have hundreds of examples of class A in our dataset, but only a few examples of class B. In this case, it is very likely that the KNN will classify points as class A, even if they are closer to class B neighbors. - -To address this limitation, a common modification to the KNN is to weight the k-nearest neighbors based on their distance to the new point. This means that closer neighbors will have a larger impact on the classification than farther neighbors. Although this is more computationally expensive, it creates a much more robust model. - -Implement a new function inside the KNN class that classifies a new point using weighted neighbors. This function should work similarly to the classify function, but should return the class based on the average distance of each class, as opposed to a simple majority vote. - -[source,python] ----- -def classify_distance(self, new_point, k=1, weights=None): - # follow the same approach as the classify function. however, for each nearest neighbor, we need to save both the label and the distance - # nearest_labels = [(label, distance), ... k times] - ''' YOUR CODE HERE ''' - - # now, we need to select the class based on each distance, not just the label - # we can find the average distance of each class and select the class with the smallest average distance - ''' YOUR CODE HERE ''' - ----- -[NOTE] -==== -It is recommended to use `defaultdict` from the `collections` module to initialize a dictionary with a default value of a list. This will allow you to append to the list without checking if the key exists. -==== - -*To test that your function works properly, we will classify the a test point at different k values. Run the below code to ensure that your function works properly. The output should be 'Iris-versicolor', 'Iris-versicolor', and 'Iris-virginica' respectively.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.classify_distance(X_test[8], k=5, weights=None)) -print(knn.classify_distance(X_test[8], k=7, weights=None)) -print(knn.classify_distance(X_test[8], k=9, weights=None)) ----- - -[NOTE] -==== -If you print some debugging information inside the function, you should see that even though at k=9 there are more 'Iris-versicolor' neighbors, the average distance of the 'Iris-virginica' neighbors is smaller and therefore is selected. -==== - -.Deliverables -==== -- Classification test at k=5, 7, and 9. -- Explanation of why the classification changes when we change the k value -- What do you think happens if we set k to the number of training points? -==== - -=== Question 5 (2 points) - -In this project you have learned about feature engineering, feature importance scaling, and different ways to handle ties in KNN. - -Based on what you have learned about KNNs, please answer the following questions. - -.Deliverables -==== -- What is the purpose of feature engineering in machine learning? -- Why is it important to scale input features in KNN? -- What are the advantages and disadvantages of the two approaches to handling ties in KNN? -- What are limitations of the KNN algorithm? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project7.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project7.adoc deleted file mode 100644 index b8b3d8005..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project7.adoc +++ /dev/null @@ -1,219 +0,0 @@ -= 301 Project 07 - Classifiers - Decision Trees - -== Project Objectives - -In this project, we will learn about Decision Trees and how they classify data. We will use the Iris dataset to classify the species of Iris flowers using Decision Trees. - -.Learning Objectives -**** -- Learn how a Decision Tree works -- Implement a Decision Tree classifier using scikit-learn -**** - -== Supplemental Reading and Resources - -- https://scikit-learn.org/stable/modules/tree.html[Scikit-learn Decision Trees Article] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - - -=== Question 1 (2 points) - -Decision Trees are a supervised learning algorithm that can be used for regression and/or classification problems. They work by splitting the data into subsets depending on the depth of the tree and the features of the data. Its goal in doing this is to create simple decision making rules to help classify the data. Because of this, Decision Trees are very easy to interpret, and often used in problems where interpretability is important. - -These trees can be easily visualized in a format similar to a flowchart. For example, if we want to classify some data point as a dog, horse, or pig, a Decision Tree may look like this: - -image::f24-301-p7-1.PNG[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - -In the above example, then we start at the root node. We then follow each condition until we reach a leaf node, which gives us our classification. - -[NOTE] -==== -In the above example, there is only one condition per node. However, in practice, there can be an unlimited number of conditions per node. These are parameters that can be adjusted when creating the Decision Tree. More conditions in one node can make the tree more complex and potentially more accurate, but it may lead to overfitting and will be harder to interpret. -==== - -Suppose we have some dataset: - -[cols="3*"] -|=== -|Temp | Size | Target -|300 | 1 | A -|350 | 1.1 | A -|427 | 90 | A -|1200 | 1.3 | B -|530 | 1.2 | B -|500 | 20 | C -|730 | 2.1 | B -|640 | 14 | C -|830 | 15.4 | C -|=== - -Please fill in the blanks for the Decision Tree below: - -image::f24-301-p7-1-2.PNG[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - - -.Deliverables -==== -- Answers for the blanks in the Decision Tree. (Please provide the number corresponding to each blank, shown in the top left corner of each box.) -==== - -=== Question 2 (2 points) - -For this question we will use the Iris dataset. As we normally do for the classification section, please load the dataset, scale it, and split it into training and testing sets using the below code. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=20) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -We can create a Decision Tree classifier using scikit-learn's `DecisionTreeClassifier` class. When constructing the class, there are several parameters that we can set to control their behavior. Some examples include: - -- `criterion`: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. -- `max_depth`: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than `min_samples_split` samples. -- `min_samples_split`: The minimum number of samples required to split an internal node. -- `min_samples_leaf`: The minimum number of samples required to be at a leaf node. - -In this project, we will explore how these parameters affect our Decision Tree classifier. To start, let's create a Decision Tree classifier with the default parameters and see how it performs on the Iris dataset. - -[source,python] ----- -from sklearn.tree import DecisionTreeClassifier -from sklearn.metrics import accuracy_score - -parameters = { - "max_depth": None, - "min_samples_split": 2, - "min_samples_leaf": 1 -} - -decision_tree = DecisionTreeClassifier(random_state=20, **parameters) -decision_tree.fit(X_train, y_train) - -y_pred = decision_tree.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate with parameters {parameters}') ----- - -.Deliverables -==== -- Output of running the above code to get the model's accuracy -==== - -=== Question 3 (2 points) - -Now that we have created our Decision tree, let's look at how we can visualize it. Scikit-learn provides a function called `plot_tree` that can be used to visualize the Decision Tree. This relies on the `matplotlib` library to plot the tree. The following code can be used to plot a given Decision Tree: - -[NOTE] -==== -The `plot_tree` function has several parameters that can be set to control the appearance of the tree. A full list of parameters can be found (here)[https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree]. -==== - -[source,python] ----- -from sklearn.tree import plot_tree -import matplotlib.pyplot as plt - -plt.figure(figsize=(20,10)) -plot_tree(decision_tree, feature_names=X.columns, class_names=decision_tree.classes_, filled=True, rounded=True) ----- - -After running this code, a graph should be generated showing how the decision tree makes decisions. Leaf nodes (nodes with no children, ie. the final decision) should have 4 values in them, whereas internal nodes (nodes with children, often called decision nodes) contain 4 values and a condition. These 4 values are as follows: - -- criterion score (in this case, gini): The score of the criterion used to split the node. This is a measure of how well the node separates the data. For gini, a score of 0 means the node contains only one class, and higher scores mean that the potential classes are more mixed. -- samples: The number of samples that fall into that node after following the decision path. -- value: An array representing the number of samples of each class that fall into that node. -- class: The class that the node would predict if it were a leaf node. - -Additionally, you can see that every box has been colored. This is done to help represent the class that the node would predict if it were a leaf node, determined by the 'value' array. As you can see, leaf nodes are a single pure color, while decision nodes may be a mix of colors (see the first decision node and the furthest down decision node). - -.Deliverables -==== -- Output of running the above code -- Based on how the tree is structured, what can we say about how similar each class is to each other? Is there a class that differs significantly from the others? -==== - -=== Question 4 (2 points) - -The first parameter we will investigate is the 'max_depth' parameter. This parameter controls how nodes are expanded throughout the tree. A larger max_depth will let the tree make more complex decisions but may lead to overfitting. - -Write a for loop that will iterate through a range of max_depth values from 1 to 10 (inclusive) and store the accuracy of the model for a given max_depth in a list called 'accuracies'. Then, run the code below to plot the accuracies. - -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(1, 11), accuracies) -plt.xlabel('Max Depth') -plt.ylabel('Accuracy') -plt.title('Accuracy vs Max Depth') ----- - -*As we increase the max_depth, what happens to the accuracy of the model? What is the smallest max_depth that gives the maximum accuracy?* - -For now, let's assume that this smallest max_depth for maximum accuracy is the best parameter to use for our model. Please display the decision trees for a max_depth of 1, this optimal max_depth, and a max_depth of 10. - -.Deliverables -==== -- Code that creates the 'accuracies' list -- Output of running the above code -- As we increase the max_depth, what happens to the accuracy of the model? What is the smallest max_depth that gives the maximum accuracy? -- Decision Trees for max_depth of 1, the optimal max_depth, and a max_depth of 10 -- What can we say about the complexity of the tree as max_depth increases? Does a high max_depth lead to uninterpretable trees, or are they still easy to follow? -==== - -=== Question 5 (2 points) - -In addition to the importance of the 'max_depth' parameter, the 'min_samples_split' and 'min_samples_leaf' parameters also have a profound effect on the Decision Tree. These parameters control, respectively, the minimum number of samples at a node to be allowed to split, and the minimum number of samples that a leaf node must have. When these values are left at their default values (2 and 1, respectively), the Decision Tree is allowed to continue splitting nodes until there is only a single sample in each leaf node. This easily leads to overfitting, as the model has created a path for the exact training data, rather than a general rule for the dataset. - -In this question, we will do something similar to what we did in the previous question, however we will do it for both the 'min_samples_split' and 'min_samples_leaf' parameters. For each parameter, we will iterate through a range of values from 2 to the size of our training data (inclusive) and store the accuracy of the model for a given value in a list called 'split_accuracies' and 'leaf_accuracies' respectively. Leave the value for the other parameter at its default. Then, run the code below to plot the accuracies. - -[source,python] ----- -plt.plot(range(2, len(X_train)), split_accuracies) -plt.plot(range(2, len(X_train)), leaf_accuracies) -plt.xlabel('Parameter Value') -plt.ylabel('Accuracy') -plt.legend(['Min Samples Split', 'Min Samples Leaf']) -plt.title('Accuracy vs Split and Leaf Parameter Values') ----- - -.Deliverables -==== -- Code that creates the 'split_accuracies' and 'leaf_accuracies' lists -- Output of running the above code -- What can we say about the effect of the 'min_samples_split' and 'min_samples_leaf' parameters on the accuracy of the model? What values of these parameters would you recommend using for this model? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project8.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project8.adoc deleted file mode 100644 index 7d69ee41c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project8.adoc +++ /dev/null @@ -1,239 +0,0 @@ -= 301 Project 08 - Classifiers - Decision Tree Ensembles - -== Project Objectives - -In this project, we will be learning about Extra Trees and Random Forests, two popular ensemble models utilizing Decision Trees. - -.Learning Objectives -**** -- Learn how Extra Trees and Random Forests work -- Implement Extra Trees and Random Forests in scikit-learn -**** - -== Supplemental Reading and Resources - -- https://scikit-learn.org/stable/modules/ensemble.html[Scikit-learn Ensemble Learning Article] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - - -=== Question 1 (2 points) -In the last project, we learned about Decision Trees. As a brief recap, Decision Trees are a type of model that classify data based on a series of conditions. These conditions are found during training, where the model will attempt to split the data into groups that are as pure as possible (100% pure being a group of datapoints that only contains a single class). As you may remember, one fatal downside of Decision Trees is how prone they are to overfitting. - -Extra Trees and Random Forests help address this downside by creating an ensemble of multiple decision trees. - -To review how a decision tree works, please classify the following three data points using the below decision tree. - -[cols="3,3,3",options="header"] -|=== -| Hue | Weight | Texture -| 10 | 150 | Smooth -| 25 | 200 | Rough -| 10 | 150 | Fuzzy -|=== -image::f24-301-p8-1.png[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - -Enseble methods work by creating multiple models and combining their results, but they all do it slightly differently. - -Random Forests work by creating multiple Decision Trees, each trained on a "bootstrapped dataset". This concept of bootstrapping allows the model to turn the original dataset into many slightly different datasets, resulting in many different models. A common and safe method for these bootstrapped datasets is to create a dataset the same size of the original dataset, but allow for resampling the same point multiple times. - -Extra Forests work in a somewhat similar manner. Instead of using the entire dataset to train each tree, however, Extra Trees will only select a random subset of features and data to train each tree. This leads to a more diverse set of trees, which helps reduce overfitting. Additionally, since features may be excluded from some trees, it can help reduce the impact of noisy features and lead to more robust classification splits. - -When making a prediction, each tree in the ensemble will make a prediction, and the final prediction will be the majority vote of all the trees (similar to our KNN) - -If we had the following Random Forest, what classification would the forest make for the same three data points? - -image::f24-301-p8-2.png[Example Random Forest, width=792, height=500, loading=lazy, title="Example Random Forest"] - -.Deliverables -==== -- Predictions of the 3 data points using the Decision Tree -- Predictions of the 3 data points using the Random Forest -==== - -=== Question 2 (2 points) - -Creating a Random Forest in scikit-learn is very similar to creating a Decision Tree. The main difference is that you will be using the `RandomForestClassifier` class instead of the `DecisionTreeClassifier` class. - -Please load the Iris dataset, scale it, and split it into training and testing sets using the below code. -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=20) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Creating a random forest in scikit is just as simple as creating a decision tree. You can create a random forest using the below code. - -[source,python] ----- -from sklearn.ensemble import RandomForestClassifier - -forest = RandomForestClassifier(n_estimators=100, random_state=20) -forest.fit(X_train, y_train) ----- - -Random forests have 1 additional parameter compared to decision trees, `n_estimators`. This parameter simply controls the number of trees in the forest. The more trees you have, typically the more robust your model will be. However, having more trees leads to longer training and prediction times, so you will need to find a balance. - - -Let's see how it performs with 100 n_estimators by running the below code. - -[source,python] ----- -from sklearn.metrics import accuracy_score -y_pred = forest.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate') ----- - -If you remember from the previous project, one of the benefits of Decision Trees is their interpretability, and the ability to display them to understand how they are working. In a large random forest, this is not quite as easy considering how many trees are in the forest. However, you can still display individual trees in the forest by accessing them in the `forest.estimators_` list. Please run the below code to display the first tree in the forest. - -[source,python] ----- -from sklearn.tree import plot_tree -import matplotlib.pyplot as plt - -plt.figure(figsize=(10,7)) -plot_tree(forest.estimators_[0], filled=True) ----- - -Since we are able to access individual trees in the forest, we can also simply use a single tree in the forest to make predictions. This can be useful if you want to understand how a single tree is making predictions, or if you want to see how a single tree is performing. - -.Deliverables -==== -- Accuracy of the Random Forest model with 100 n_estimators -- Display the first tree in the forest -==== - -=== Question 3 (2 points) - -Similar to investigating the Decision Tree's parameters in project 7, let's investigate how the number of trees in the forest affects the accuracy of the model. Additionally, we will also measure the time it takes to train and test the model. - -Please create random forests with 10 through 1000 trees, in increments of 10, and record the accuracy of each model and time it takes to train/test into lists called `accuracies` and `times`, respectively. Plot the number of trees against the accuracy of the model. Be sure to use a `random_state` of 13 for reproducibility. - -Code to display the accuracy of the model is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 1001, 10), accuracies) -plt.xlabel('N_Estimators') -plt.ylabel('Accuracy') -plt.title('Accuracy vs N_Estimators') ----- - -Code to display the time it takes to train and test the model is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 1001, 10), times) -plt.xlabel('N_Estimators') -plt.ylabel('Time') -plt.title('Time vs N_Estimators') ----- - -.Deliverables -==== -- Code to generate the data for the plots -- Graph showing the number of trees in the forest against the accuracy of the model -- Graph showing the numebr of trees in the forest against the time it takes to train and test the model -- What is happening in the first graph? Why do you think this is happening? -- What is the relationship between the number of trees and the time it takes to train and test the model (linear, exponential, etc)? -==== - -=== Question 4 (2 points) - -Now, let's look at our Extra Trees model. Creating an Extra Trees model is the same as creating a Random Forest model, but using the `ExtraTreesClassifier` class instead of the `RandomForestClassifier` class. - -[source,python] ----- -from sklearn.ensemble import ExtraTreesClassifier - -extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=20) -extra_trees.fit(X_train, y_train) ----- - - -Let's see how it performs with 100 n_estimators by running the below code. - -[source,python] ----- -from sklearn.metrics import accuracy_score - -y_pred = extra_trees.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate') ----- - -.Deliverables -==== -- Accuracy of the Extra Trees model with 100 n_estimators -==== - -=== Question 5 (2 points) - -It would be repetitive to investigate how n_estimators affects the accuracy and time of the Extra Trees model, as it would be the same as the Random Forest model. - -Instead, let's look into the differences between the two models. The primary difference between these two models is how they select the data to train each tree. Random Forests use bootstrapping to create multiple datasets, while Extra Trees use a random subset of features and data to train each tree. - -We can see how important each feature is to the model by looking at the `feature_importances_` attribute of the model. This attribute will show how important each feature is to the model, with higher values being more important. Please run the below code to create new Random Forest and Extra Trees models, and diplay the feature importance for each. Then, write your own code to calculate the average number of features being used in each tree for both models. - -[source,python] ----- -import matplotlib.pyplot as plt - -forest = RandomForestClassifier(n_estimators=100, random_state=20, bootstrap=True, max_depth=4) -forest.fit(X_train, y_train) - -extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=20, bootstrap=True, max_depth=4) -extra_trees.fit(X_train, y_train) - -plt.bar(X.columns, forest.feature_importances_) -plt.title('Random Forest Feature Importance') -plt.show() - -plt.bar(X.columns, extra_trees.feature_importances_) -plt.title('Extra Trees Feature Importance') -plt.show() ----- - -.Deliverables -==== -- Code to display the feature importance of the Random Forest and Extra Trees models -- Code to calculate the average number of features being used in each tree for both models -- What are the differences between the feature importances of the Random Forest and Extra Trees models? Why do you think this is? -==== - - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project8.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project9.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project9.adoc deleted file mode 100644 index 71ef3a611..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-project9.adoc +++ /dev/null @@ -1,323 +0,0 @@ -= 301 Project 09 - Regression: Basics -:page-mathjax: true - -== Project Objectives - -In this project, we will learn about the basics of regression. We will explore common regression techniques and how to interpret their results. We will also investigate the strengths and weaknesses of different regression techniques and how to choose the right one for a given problem. - -.Learning Objectives -**** -- Basics of Regression -- Regression specific terminology and metrics -- Popular regression techniques -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -The most common regression technique is linear regression. Have you ever generated a trendline in Excel? If so, that is a form of linear regression! There are multiple forms of linear regression, but the most common is called `simple linear regression`. Other forms include `multiple linear regression`, and `polynomial regression`. - -[NOTE] -==== -It may seem counter intuitive that `polynomial regression` is considered a form of linear regression. When the regression model is trained for some polynomial degree, say y= ax^2 + bx + c, the model does not know that x^2 is the square of x. It instead treats x^2 as a separate variable (z = x^2), ie. y = az + bx + c, thus a linear equation. Colinearity between z and x are an issue, which is why regularization techniques, such as lasso and ridge regression, should be used to help prevent overfitting. -==== - -Each of these forms is slightly different, but at their core, they all attempt to model the relationship between one or more independent variables and one or more dependent variable. - -[cols="4,4,4,4",options="header"] -|=== -| Model | Independent | Dependent | Description -| Simple Linear Regression | 1 variable | 1 variable | Models the relationship between one independent variable and one dependent variable. Think of y = ax + b, where y is the dependent variable, x is the independent variable, and a and b are coefficients. -| Multiple Linear Regression | 2+ variables | 1 variable | Models the relationship between two or more independent variables and one dependent variable. Think z = ax + by + c, where z is the dependent variable, x and y are independent variables, and a, b, and c are coefficients. -| Polynomial Regression | 1 variable | 1 variable | Models the relationship between one or more independent variables and one dependent variable using a polynomial function. Think y = ax^2 + bx + c, where y is the dependent variable, x is the independent variable, and a, b, and c are coefficients. -| Multiple Polynomial Regression | 2+ variables | 1 variable | Models the relationship between two or more independent variables and one dependent variable using a polynomial function. Think z = ax^2 + by^2 + cx + dy + e, where z is the dependent variable, x and y are independent variables, and a, b, c, d, and e are coefficients. -| Multivariate Linear Regression | 2+ variables | 2+ variables | Models the relationship between two or more independent variables and two or more dependent variables. Can be linear or polynomial. Think Y = AX + B. Where Y and X are matrices, and A and B are matrices of coefficients. This allows predictions of multiple dependent variables at once. -| Multivariate Polynomial Regression | 2+ variables | 2+ variables | Same as Multivariate Linear Regression, but for polynomials. Also generalized to Y = AX + B, however X must have all independent variables and their polynomial terms, and A and B must be much larger matrices to store these coefficients. -|=== - -For this question, please run the following code the load our very simple dataset. - -[source,python] ----- -import numpy as np -import matplotlib.pyplot as plt - -# Data -x = np.array([1, 2, 3, 4, 5]) -y = np.array([5.5, 7, 9.5, 12.5, 13]) ----- - -Using the data above, find the best fit line for the data using simple linear regression. Store the slope and y-intercept in the variables `a` and `b` respectively. -[NOTE] -==== -We can find lines of best fit using the np.polyfit function. Although this function is built for polynomial regression, it can be used for simple linear regression by setting the degree parameter to 1. This function returns an array of coefficients, ordered from highest degree to lowest. For simple linear regression (y = mx + b), the first coefficient is the slope (m) and the second is the y-intercept (b). -==== - -[source,python] ----- -# Find the best fit line - -# YOUR CODE HERE -a, b = np.polyfit(x, y, 1) - -# Plot the data and the best fit line - -print(f'Coefficients Found: {a}, {b}') -y_pred = a * x + b - -plt.scatter(x, y) -plt.plot(x, y_pred, color='red') -plt.xlabel('x') -plt.ylabel('y') -plt.show() ----- - -.Deliverables -==== -- Coefficients found by np.polyfit with degree 1 -- Plot of the data and the best fit line -==== - -=== Question 2 (2 points) - -After finding the best fit line, we should have two variables stored: `y`, and `y_pred`. Now that we have these, we can briefly discuss evaluation metrics for regression models. There are many, many metrics that can be used to evaluate regression models. We will discuss a few of the most common ones here, but we implore you to do further research on your own to learn about more metrics. - -[cols="4,4,4,4",options="header"] -|=== -| Metric | Description | Formula | Range -| Mean Squared Error (MSE) | Average of the squared differences between the predicted and actual values. | $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. | $[0, \infty)$ -| Root Mean Squared Error (RMSE) | Square root of the MSE. | $RMSE = \sqrt{MSE}$ | $[0, \infty)$ -| Mean Absolute Error (MAE) | Average of the absolute differences between the predicted and actual values. | $MAE = \frac{1}{n} \sum_{i=1}^{n} \mid y_i - \hat{y}_i \mid $, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. | $[0, \infty)$ -| R-Squared | Explains the variance of dependent variables that can be explained by the independent variables. | $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals (actual - prediction) and $SS_{tot}$ is the total sum of squares (actual - mean of actual). | $[0, 1]$ -|=== - - -Using the variables `y` and `y_pred` from the previous question, calculate the following metrics: MSE, RMSE, MAE, and R-Squared. Write functions for `get_mse`, `get_rmse`, `get_mae`, and `get_r_squared`, that each take in actual and predicted values. Call these functions on y and y_pred and store the results in the variables `mse`, `rmse`, `mae`, and `r_squared` respectively. - -[source,python] ----- -# Calculate the evaluation metrics -# YOUR CODE HERE -def get_mse(y, y_pred): - pass - -def get_rmse(y, y_pred): - pass - -def get_mae(y, y_pred): - pass - -def get_r_squared(y, y_pred): - pass - -mse = get_mse(y, y_pred) -rmse = get_rmse(y, y_pred) -mae = get_mae(y, y_pred) -r_squared = get_r_squared(y, y_pred) - - -print(f'MSE: {mse}') -print(f'RMSE: {rmse}') -print(f'MAE: {mae}') -print(f'R-Squared: {r_squared}') ----- - -.Deliverables -==== -- Output of the evaluation metrics -==== - -=== Question 3 (2 points) - -Now that we understand some evaluation metrics, let's see how polynomial regression compares to simple linear regression on our same dataset. We will explore a range of polynomial degrees and see how the evaluation metrics change. Firstly, let's write a function that will take in an x value and an array of coefficients and return the predicted y value using a polynomial function. - -[source,python] ----- -def poly_predict(x, coeffs): - # y_intercept is the last element in the array - y_intercept = None # your code here - - # predicted value can start as the y-intercept - predicted_value = y_intercept - - # The rest of the elements are the coefficients, so we can determine the degree of the polynomial - coeffs = coeffs[:-1] - current_degree = None # your code here - - # Now, we can iterate through the coefficients and make a sum of coefficient * (x^current_degree) - # remember that the first element in the array is the coefficient for the highest degree, and the last element is the coefficient for the lowest degree - for i, coeff in enumerate(coeffs): - # your code here to increment the predicted value - - pass - - return predicted_value ----- - -Once you have created this function, please run the following code to ensure that it works properly. - -[source,python] ----- -assert poly_predict(2, [1, 2, 3]) == 11 -assert poly_predict(4, [1, 2, 3]) == 27 -assert poly_predict(3, [1, 2, 3, 4, 5]) == 179 -assert poly_predict(4, [2.5, 2, 3]) == 51 -print("poly_predict function is working!") ----- - -Now, we will perform the np.polyfit function for degrees ranging from 2 to 5. For each degree, we will get the coefficients, calculate the predicted values, and then calculate the evaluation metrics. Store the results in a dictionary where the key is the degree and the value is a dictionary containing the evaluation metrics. - -[NOTE] -==== -If you correctly implement this, numpy will issue a warning that says "RankWarning: Polyfit may be poorly conditioned". We expect you to run into this and think about what it means. You can hide this message by running the code -[source,python] ----- -import warnings -warnings.simplefilter("ignore", np.RankWarning) ----- -==== - -[source,python] ----- -results = dict() -for degree in range(2, 6): - # get the coefficients - coeffs = None # your code here - - # Calculate the predicted values - y_pred = None # your code here - - # Calculate the evaluation metrics - mse = get_mse(y, y_pred) - rmse = get_rmse(y, y_pred) - mae = get_mae(y, y_pred) - r_squared = get_r_squared(y, y_pred) - - # Store the results in a new dictionary that is stored in the results dictionary - # eg, results[2] = {'mse': 0.5, 'rmse': 0.7, 'mae': 0.3, 'r_squared': 0.9} - results[degree] = None # your code here - -results ----- - -.Deliverables -==== -- Function poly_predict -- Output of the evaluation metrics for each degree of polynomial regression -- Which degree of polynomial regression performed the best? Do you think this is the best model for this data? Why or why not? -==== - -=== Question 4 (2 points) - -In question 1, we briefly mentioned that regularization techniques are used to help prevent overfitting. Regularization techniques add term to the loss function that penalizes the model for having large coefficients. In practice, this helps make sure that the model is fitting to patterns in the data, rather than noise or outliers. The two most common regularization techniques for machine learning are LASSO (L1 Regulariziation) and Ridge (L2 Regularization). - -LASSO is an acronym for Least Absolute Shrinkage and Selection Operator. Essentially, this regularization technique computes the sum of the absolute values of the coefficients and uses it as the penalty term in the loss function. This helps ensure that the magnitude of coefficients is kept small, and can often lead to some coefficients being set to zero. This essentially helps the model perform feature selection to improve generalization. - -Ridge regression works in a similar matter, however it uses the sum of each coefficient squared instead of the absolute value. This also helps force the model to use smaller coefficients, but typically does not set any coefficients to zero. This typically helps reduce collinearity between features. - - -Now, our 5th degree polynomial from the previous question had perfect accuracy. However, looking at the data yourself, do you really believe that the data is best represented by a 5th degree polynomial? The linear regression model from question 1 is likely the best model for this data. Using the coefficients from the 5th degree polynomial, print the predicted values are for the following x values: - -[source,python] ----- -x_values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) ----- - -Are the predicted y values reasonable for 1 through 5? what about 6 through 10? - -Let's see if we can improve this 5th degree polynomial by using Ridge regression. Ridge regression is implemented in the scikit-learn linear models module under the `Ridge` class. Additionally, to ensure we are using a 5th degree polynomial, we will first need the `PolynomialFeatures`` class from the preprocessing module. Finally, we can use scikit-learn pipelines to chain these two models together through the `make_pipeline` function. The code below demonstrates how to use these three classes together. - -[source,python] ----- -from sklearn.preprocessing import PolynomialFeatures -from sklearn.linear_model import Ridge -from sklearn.pipeline import make_pipeline - -n_degree = 5 -polyfeatures = PolynomialFeatures(degree=n_degree) - -alpha = 0.1 # regularization term. Higher values = more regularization, 0 = simple linear regression -ridge = Ridge(alpha=alpha) - -model = make_pipeline(polyfeatures, ridge) - -# we need to reshape the data to be a 2D array -x = np.array([1, 2, 3, 4, 5]) -x = x.reshape(-1, 1) -y = np.array([5.5, 7, 9.5, 12.5, 13]) - -# fit the model -model.fit(x, y) - -# predict the values -y_pred = model.predict(x) ----- - -Now that you have a fitted Ridge model, what are the coefficients (you can get them with `model.named_steps['ridge'].coef_` and `model.named_steps['ridge'].intercept_`), and how do they compare to the previous 5th degree polynomial's coefficients? Are these predicted values more reasonable for 1 through 5? what about 6 through 10? - -.Deliverables -==== -- Predicted values for x_values using the 5th degree polynomial -- Are the predicted values reasonable for 1 through 5? what about 6 through 10? -- Code to use Ridge regression on the 5th degree polynomial -- How do the coefficients of the Ridge model compare to the 5th degree polynomial? -- Are the L2 regularization predicted values more reasonable for 1 through 5? what about 6 through 10? -==== - -=== Question 5 (2 points) - -As you see from the previous question, Ridge can help penalize large coefficients to help stop overfitting. However, it can never really fully recover when our baseline model is overfit. LASSO, on the other hand, can help us recover from overfitting by setting some coefficients to zero. Let's see if LASSO can help us recover from our overfit 5th degree polynomial. - -LASSO regression is implemented in the scikit-learn linear models module under the `Lasso` class. We can use the same pipeline as before, but with the Lasso class instead of the Ridge class. - -[NOTE] -==== -The Lasso class has an additional parameter, max_iter, which is the maximum number of iterations to run the optimization. For this question, set max_iter=10000. -==== - -After you have done this, let's see how changing the value of `alpha` affects our coefficients. To give an overall value to the coefficients, we will use the L1 method, which is the sum of the absolute values of the coefficients. For example, the below code will give the L1 value of the LASSO coefficients. - -[source,python] ----- -value = np.sum(np.abs(model.named_steps['lasso'].coef_)) ----- - -For each alpha value from .1 to 1 in increments of .01, fit the LASSO model and Ridge model to the data. Calculate the L1 value of the model's coefficients for each alpha value, and store them in the lists `lasso_values` and `ridge_values` respectively. Then, run the below code to plot the alpha values against the L1 values for both the LASSO and Ridge models. - -[source,python] ----- -plt.plot(np.arange(.1, 1.01, .01), lasso_values, label='LASSO') -plt.plot(np.arange(.1, 1.01, .01), ridge_values, label='Ridge') -plt.xlabel('Alpha') -plt.ylabel('L1 Value') -plt.legend() -plt.show() ----- - -.Deliverables -==== -- How do the LASSO model's coefficients compare to the 5th degree polynomial? -- How do the LASSO model's coefficients compare to the Ridge model's coefficients? -- What is the relationship between the alpha value and the L1 value for both the LASSO and Ridge models? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project9.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-projects.adoc deleted file mode 100644 index 7a3a0ef17..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/30100/30100-2024-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 30100 - -== Important Links - -xref:fall2024/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:fall2024/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm1010010200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Aug 19, 2024 | Aug 30, 2024 -| Academic Integrity Quiz | Aug 19, 2024 | Aug 30, 2024 -| Project 1 - Intro to ML - Using Anvil | Aug 19, 2024 | Aug 30, 2024 -| Project 2 - Intro to ML - Basic Concepts | Aug 22, 2024 | Aug 30, 2024 -| Project 3 - Intro to ML - Data Preprocessing | Aug 29, 2024 | Sep 06, 2024 -| Project 4 - Classifiers - Basics of Classification | Sep 05, 2024 | Sep 13, 2024 -| Outside Event 1 | Aug 19, 2024 | Sep 13, 2024 -| Project 5 - Classifiers - K-Nearest Neighbors (KNN) I | Sep 12, 2024 | Sep 20, 2024 -| Project 6 - Classifiers - K-Nearest Neighbors (KNN) II | Sep 19, 2024 | Sep 27, 2024 -| Project 7 - Classifiers - Decision Trees | Sep 26, 2024 | Oct 04, 2024 -| Outside Event 2 | Aug 19, 2024 | Oct 04, 2024 -| Project 8 - Classifiers - Decision Tree Ensembles | Oct 03, 2024 | Oct 18, 2024 -| Project 9 - Regression: Basics | Oct 17, 2024 | Oct 25, 2024 -| Project 10 - Regression: Perceptrons | Oct 24, 2024 | Nov 01, 2024 -| Project 11 - Regression: Artificial Neural Networks (ANN) - Multilayer Perceptron (MLP) | Oct 31, 2024 | Nov 08, 2024 -| Outside Event 3 | Aug 19, 2024 | Nov 08, 2024 -| Project 12 - Regression: Bayesian Ridge Regression | Nov 7, 2024 | Nov 15, 2024 -| Project 13 - Hyperparameter Tuning | Nov 14, 2024 | Nov 29, 2024 -| Project 14 - Class Survey | Nov 21, 2024 | Dec 06, 2024 -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -// **Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project1.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project1.adoc deleted file mode 100644 index e8c7e1c2b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project1.adoc +++ /dev/null @@ -1,235 +0,0 @@ -= TDM 40100: Project 01 - Intro to ML - Using Anvil - -== Project Objectives - -We remind ourselves how to use the Anvil platform and how to run Python code in Jupyter Lab. We also remind ourselves about using the Pandas library. This project is intended to be a light start to the fall semester. - -.Learning Objectives -**** -- Create and use Anvil sessions -- Create Jupyter notebooks -- Load dataset with pandas -- Basic data manipulation with pandas -**** - -== Dataset - -This project will use the following dataset: -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -Let's start out by starting a new Anvil session. If you do not remember how to do this, please read through https://the-examples-book.com/projects/fall2024/10100/10100-2024-project1[Project 1 at the introduction TDM 10100 level]. - -Once you have started a new Anvil session, download https://the-examples-book.com/projects/_attachments/project_template.ipynb[the project template] and upload it. Then, open this template in Jupyter notebook. Save it as a new file with the following naming convention: `lastname_firstname_project#.ipynb`. For example, `doe_jane_project1.ipynb`. - -[NOTE] -==== -You may be prompted to select a kernel when opening the notebook. We will use the `seminar` kernel (not the `seminar-r` kernel) for TDM 40100 projects. You are able to change the kernel by clicking on the kernel dropdown menu and selecting the appropriate kernel if needed. -==== - -To make sure everything is working, run the following code cell: -[source,python] ----- -print("Hello, world!") ----- - -Your output should be `Hello, world!`. If you see this, you are ready to move on to the next question. - -Although question 1 is trivially easy, we still want you to (please) get into the habit of commenting on the work in each question. So (please) it would be helpful to write (in a separate cell) something like, "We are reminding ourselves how to use Anvil and how to print a line of output." - -.Deliverables -==== -- Output of running the code cell -- Be sure to document your work from Question 1, using some comments and insights about your work. -==== - -=== Question 2 (2 points) - -Now that we have our Jupyter Lab notebook set up, let's begin working with the pandas library. - -Pandas is a Python library that allows us to work with datasets in tabular form. There are functions for loading datasets, manipulating data, etc. - -To start out with, let's load the Iris dataset that is located at `/anvil/projects/tdm/data/iris/Iris.csv`. - -To do this, you will need to import the pandas library and use the `read_csv` function to load the dataset. - -Run the following code cell to load the dataset: -[source,python] ----- -import pandas as pd - -myDF = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') ----- - -[NOTE] -==== -In the provided code, pandas is imported as `pd` for brevity. This is a common convention in the Python community. Similarly, `myDF` (short for "my dataframe") is often used as a variable for pandas dataframes. It is not required for you to follow either of these conventions, but it is good practice to do so. -==== - -Now that our dataset is loaded, let's take a look at the first 5 rows of the dataset. To do this, run the following code cell: -[source,python] ----- -myDF.head() ----- - -[NOTE] -==== -The head function is used to display the first n rows of the dataset. By default, n is set to 5. You can change this by passing an integer to the function. For example, `myDF.head(10)` will display the first 10 rows of the dataset. This function is useful for quickly inspecting the dataframe to see what the data looks like. -==== - -.Deliverables -==== -- Output of running the code cell -- Be sure to document your work from Question 2, using some comments and insights about your work. -==== - -=== Question 3 (2 points) - -An important aspect of our dataframe for machine learning is the shape (rows, columns). As you will learn later, the shape will help us determine what kind of machine learning model will be the best fit, as well as how complex it may be. - -To get the shape of the dataframe, run the following code cell: -[source,python] ----- -myDF.shape ----- - -[NOTE] -==== -There are multiple ways to get the number of rows and columns in a DataFrame. `len(myDF.index)` gives the number of rows, and `len(myDF.columns)` gives the number of columns in a DataFrame. The `shape` attribute is commonly preferred because it’s more concise and returns both the number of rows and columns in a single call. -==== - -This returns a tuple in the form (rows, columns). - -.Deliverables -==== -- How many rows are in the dataframe? -- How many columns are in the dataframe? -- Be sure to document your work from Question 3, using some comments and insights about your work. -==== - -=== Question 4 (2 points) - -Now that we have loaded the dataset, let's investigate how we can manipulate the data. - -One common operation is to select a subset of the data. This is done using the `iloc` function, which allows us to index the dataframe by row and column numbers. -[NOTE] -==== -The `iloc` function is extremely powerful. It can be used in way too many ways to list here. For a more comprehensive list of how to use `iloc`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html[the official pandas iloc documentation]. -==== - -To select the first n rows of the dataframe, we can use the `iloc` function with a slice: `myDF.iloc[:n]`. - -Write code to select the first 10 rows of the dataframe from Question 3 into a new dataframe called `myDF_subset`. Print the shape of `myDF_subset` to verify that you have selected the correct number of rows. - -We can also use the `iloc` function to select specific columns. To select specific columns, we can also use a slice, however we must specify the rows we want first. To select all rows, we simply pass a colon `:`. For example, to select the first 10 rows and the first 3 columns, we could use the following code: `myDF.iloc[:10, :3]`. - -Write code to select the 40th through 50th rows (inclusive) and the 2nd and 4th columns of the dataframe from Question 3 into a new dataframe called `myDF_subset2`. Print the shape of `myDF_subset2` to verify that you have selected the correct number of rows and columns. - -The iloc function can also be used to filter rows based on a condition. For example, if we wanted all rows where the PetalWidthCm is greater than 1.5, we could use the following code: `myDF.loc[myDF['PetalWidthCm'] > 1.5, :]`. - -Write code to select all rows where SepalLengthCm is less than 5.0 into a new dataframe called `myDF_subset3`. How many rows are in this dataframe? - -.Deliverables -==== -- Output of printing the shape of `myDF_subset` -- Output of printing the shape of `myDF_subset2` -- How many rows are in the `myDF_subset3` dataframe? -- Be sure to document your work from Question 4, using some comments and insights about your work. -==== - -=== Question 5 (2 points) - -Another common operation is to remove column(s) from the dataframe. This is done using the `drop` function. - -[NOTE] -==== -Similarly to the `iloc` function, the `drop` function is extremely powerful. For a more comprehensive list of how to use `drop`, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html[the official pandas drop documentation]. -==== - -The most readable way to drop a column is by dropping it by name. To drop column(s) by name, you can use the following syntax: `myDF.drop(['column1_name', 'column2_name', ...], axis=1)`. The `axis=1` argument tells pandas to drop columns, not rows. - -Write code to drop the `Id` column from the myDF_subset into a new dataframe called `myDF_without_id`. Print the shape of the dataframe to verify that the column has been removed. - -Additionally, we can extract columns from a dataframe into a new dataframe. Extracting a column is very simple: `myDF['column_name']` will return a pandas series containing the values of the column. To extract multiple columns, you can pass a list of column names: `myDF[['column1_name', 'column2_name', ...]]`. -To then store these series into a new dataframe, we can simply cast the series into a dataframe: `pd.DataFrame(myDF['column_name'])`. - -Write code to extract the `Species` and `SepalWidthCm` columns from the `myDF_without_id` dataframe into a new dataframe called `myDF_species`. Print the shape of the dataframe to verify that the column has been extracted. Print the first 5 rows of the dataframe to verify that the columns have been extracted correctly. - -.Deliverables -==== -- Output of printing the shape of the dataframe after dropping the `Id` column -- Output of printing the first 5 rows of the dataframe after extracting the `Species` and `SepalWidthCm` columns -- Be sure to document your work from Question 5, using some comments and insights about your work. -==== - -=== Question 6 (2 points) - -We briefly touched on filtering rows based on a condition in Question 4. In this case, we simply filtered by one condition. It is fine to simply filter by one condition repeatedly until you have performed all the filtering you need. However, it is also possible to filter by multiple conditions in a single operation. - -Pandas allows us to use logical operators to combine multiple conditions into a boolean expression. The logical operators are `&` for "and", `|` for "or", and `~` for "not". We can use these operators allong with conditionals to filter rows based on multiple conditions. - -We can store each condition in a variable like so: -[source,python] ----- -condition1 = myDF[column] > value1 -condition2 = myDF[column] == value2 ----- - -Once we have these conditions, we can combine them in the `iloc` function like so: -[source,python] ----- -#both conditions must be true -myDF.iloc[condition1 & condition2] -#condition1 must be false or condition2 must be true -myDF.iloc[~condition1 | condition2] ----- - -Write code to filter the `myDF` dataframe to only include rows where SepalLengthCm is greater than 5.0, PetalWidthCm is less than 1.5, and Species is not `Iris-setosa`. -Store the filtered dataframe in a new variable called `myDF_filtered`. How many rows meet these conditions? - -.Deliverables -==== -- How many rows meet the conditions? -- Be sure to document your work from Question 6, using some comments and insights about your work. -==== - -=== Question 7 (2 points) - -One of the most common operations in data analysis is to calculate summary statistics. This includes things like the mean, median, standard deviation, etc. - -Pandas has many operations to calculate these statistics. To calculate the mean of a column, we can simply write `myDF[column].mean()`. Similarly, to calculate the median, we can write `myDF[column].median()`. For standard deviation, we can write `myDF[column].std()`. - -Additionally, we can find all unique values in a column by using the `unique` function. `myDF[column].unique()` will return a list containing all unique values in the column. - -[NOTE] -==== -It may be beneficial to cast the result of `unique` to a list to make it easier to work with. -==== - -Write a code to determine the mean, median, and standard deviation of the `SepalLengthCm` column in the `myDF` dataframe, for each unique species in the `Species` column. -Please note that there are 3 unique species in the `Species` column, so you should have 3 sets of statistics. - -.Deliverables -==== -- Output of the mean, median, and standard deviation of the `SepalLengthCm` column for each unique species -- Be sure to document your work from Question 7, using some comments and insights about your work. -==== - -== Submitting your Work - -Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope. - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project10.adoc deleted file mode 100644 index 81365c750..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project10.adoc +++ /dev/null @@ -1,428 +0,0 @@ -= 401 Project 10 - Regression: Perceptrons - -== Project Objectives - -In this project, we will be learning about perceptrons and how they can be used for regression. We will be using the Boston Housing dataset as it has many different potential features and target variables. - -.Learning Objectives -**** -- Understand the basic concepts behind a perceptron -- Implement activation functions and their derivatives -- Implement a perceptron class for regression -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -A perceptron is a simple model that can be used for regression. These perceptrons can be combined together to create neural networks. In this project, we will be creating a perceptron from scratch. - -To start, let's load in the Boston Housing dataset with the below code: -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Now, we can begin discussing what a perceptron is. A perceptron is a simple model that takes in a set of inputs and produces an output. The perceptron is defined by a set of weights and a bias term (similar to our linear regression model having coefficients and an y intercept term). The perceptron then takes the dot product of the input features and the weights and adds the bias term. - -Then, the perceptron will apply some activation function before outputting the final value. This activation function is some non-linear function that allows the perceptron to learn complex data, instead of behaving as a linear model. - -There are many different activation functions, some of the most common are listed below: - -[cols="2,2,2,2",options="header"] -|=== -|Activation Function | Formula | Derivative | Usage -|Linear | x | 1 | Final layer of regression to output continuous values -|ReLU | max(0, x) | 1 if x > 0, 0 otherwise | Hidden layers of neural networks -|Sigmoid | 1 / (1 + exp(-x)) | sigmoid(x) * (1 - sigmoid(x)) | Final layer of binary classification, or hidden layers of neural networks -|Tanh | (exp(x) - exp(-x)) / (exp(x) + exp(-x)) | 1 - tanh(x)^2 | Hidden layers of neural networks -|=== - -For this project, we will be creating a perceptron class that can be used for regression. There are many different parameters that can be set when creating a perceptron, such as the learning rate, number of epochs, and activation function. - -For this question, please implement functions for the Linear, ReLU, Sigmoid, and Tanh activation functions. Additionally, implement the derivative of each of these functions. These functions should be able to take in a numpy array and return the transformed array. - -[source,python] ----- -import numpy as np -def linear(x): - pass -def linear_d(x): - pass -def relu(x): - pass -def relu_d(x): - pass -def sigmoid(x): - pass -def sigmoid_d(x): - pass -def tanh(x): - pass -def tanh_d(x): - pass ----- - -To test your functions, you can use the below code: -[source,python] ----- -x = np.array([-1, 0, 1]) -print(linear(x)) # should return [-1, 0, 1] -print(linear_d(x)) # should return [1, 1, 1] -print(relu(x)) # should return [0, 0, 1] -print(relu_d(x)) # should return [0, 0, 1] -print(sigmoid(x)) # should return [0.26894142, 0.5, 0.73105858] -print(sigmoid_d(x)) # should return [0.19661193, 0.25, 0.19661193] -print(tanh(x)) # should return [-0.76159416, 0, 0.76159416] -print(tanh_d(x)) # should return [0.41997434, 1, 0.41997434] ----- - -.Deliverables -==== -- Completed activation and derivative functions -- Test the functions with the provided code -==== - -=== Question 2 (2 points) - -Now that we have our activation functions, let's start working on our Perceptron class. This class will create a perceptron that can be used for regression problems. Below is a skeleton of our Perceptron class: - -[source,python] ----- -class Perceptron: - def __init__(self, learning_rate=0.01, n_epochs=1000, activation='relu'): - # this will initialize the perceptron with the given parameters - pass - - def activate(self, x): - # this will apply the activation function to the input - pass - - def activate_derivative(self, x): - # this will apply the derivative of the activation function to the input - pass - - def compute_linear(self, X): - # this will calculate the linear combination of the input and weights - pass - - def error(self, y_pred, y_true): - # this will calculate the error between the predicted and true values - pass - - def backward_gradient(self, error, linear): - # this will update the weights and bias of the perceptron - pass - - def predict(self, X): - # this will predict the output of the perceptron given the input - pass - - def train(self, X, y, reset_weights = True): - # this will train the perceptron on the given input and target values - pass - - def test(self, X, y): - # this will test the perceptron on the given input and target values - pass ----- - -Now, it may seem daunting to implement all of these functions. However, most of these functions are as simple as one mathematical operation. - -*For this question, please implement the `__init__`, `activate`, and `activate_derivative` functions.* -The `__init__` function should initialize the perceptron with the given parameters, as well as setting weights and bias terms to None. - -The `activate` function should apply the activation function to the input, and the `activate_derivative` function should apply the derivative of the activation function to the input. It is important that these functions use the proper function based on the value of `self.activation`. Additionally, if the activation function is not set to one of the three functions we implemented earlier, the default should be the ReLU function. - -To test your functions, you can use the below code: -[source,python] ----- -test_x = np.array([-2, 0, 1.5]) -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='linear') -print(p.activate(test_x)) # should return [-2, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [1, 1, 1] -p.activation = 'relu' -print(p.activate(test_x)) # should return [0, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [0, 0, 1] -p.activation = 'sigmoid' -print(p.activate(test_x)) # should return [0.11920292, 0.5, 0.81757448] -print(p.activate_derivative(test_x)) # should return [0.10499359, 0.25, 0.14914645] -p.activation = 'tanh' -print(p.activate(test_x)) # should return [-0.96402758, 0, 0.90514825] -print(p.activate_derivative(test_x)) # should return [0.07065082, 1, 0.18070664] -p.activation = 'invalid' -print(p.activate(test_x)) # should return [0, 0, 1.5] -print(p.activate_derivative(test_x)) # should return [0, 0, 1] ----- -.Deliverables -==== -- Implement the `__init__`, `activate`, and `activate_derivative` functions -- Test the functions with the provided code -==== - -=== Question 3 (2 points) - -Now, let's move onto the harder topics. The basic concept behind how this perceptron works is that it will take in an input, calculate the predicted value, find the error between the predicted and true value, and then update the weights and bias based on this error and it's learning rate. This process is then repeated for the set number of epochs. - -In this sense, there are to main portions of the perceptron that need to be implemented: the forward and backward passes. The forward pass is the process of calculating the predicted value, and the backward pass is the process of updating the weights and bais based on the calculated error. - -*For this question, we will implement the `compute_linear`, `predict`, `error`, and `backward_gradient` functions.* - -The `compute_linear` function should calculate the linear combination of the input, weights, and bias, by computing the dot product of the input and weights and adding the bias term. - -The `predict` function should compute the linear combination of the input and then apply the activation function to the result. - -The `error` function should calculate the error between the predicted (y_pred) and true (y_true) values, ie true - predicted. - -The `backward_gradient` should calculate the gradient of the error, which is simply the negative of the error multiplied by the activation derivative of the linear combination. - -To test your functions, you can use the below code: -[source,python] ----- -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='sigmoid') -p.weights = np.array([1, 2, 3]) -p.bias = 4 - -test_X = np.array([1,2,3]) -test_y = np.array([20]) - -l = p.compute_linear(test_X) -print(l) # should return 18 -error = p.error(l, test_y) -print(error) # should return 2 -gradient = p.backward_gradient(error, l) -print(gradient) # should return -3.04599585e-08 -pred = p.predict(test_X) # should return 0.9999999847700205 -print(pred) ----- - -.Deliverables -==== -- Implement the `compute_linear`, `predict`, `error`, and `backward_gradient` functions -- Test the functions with the provided code -==== - -=== Question 4 (2 points) - -Now that we have implemented all of our helper functions, we can implement our `train` function. - -Firstly, if the argument 'reset_weights' is true, or if `reset_weights` is false but the weights and bias are not set, we will initialize our weights to a np array of zeros with the same length as the number of features in our input data. We will also initialize our bias to 0. In any other case, we will not modify the weights and bias. - -Then, this function will train the perceptron on the given training data. For each datapoint in the training data, we will get the linear combination of the input and the predicted value through our activation function. Then, we will compute the error and get the backward gradient. Then, we will calculate the gradient for our weights (simply the input times the backward gradient) and the gradient for our bias (simply the backward gradient). Finally, we will update the weights and bias by multiplying the gradients by the learning rate, and subtracting them from the current weights and bias. This process will be repeated for the set number of epochs. - -[NOTE] -==== -In this case, we are updating the weights and bias after every datapoint. This is commonly known as Stochastic Gradient Descent (SGD). Another common method is to calculate our error for every datapoint in the epoch, and then update the weights and bias based on the average error at the end of each epoch. This method is known as Batch Gradient Descent (BGD). A more sophisticated called Mini-Batch Gradient Descent (MBGD) is a combination of the two philosophies, where we group our data into small batches and update our weights and bias after each batch. This results in more weight/bias updates than BGD, but less than SGD. -==== - -In order to test your function, we will create a perceptron and train it on the Boston Housing dataset. We will then print the weights and bias of the perceptron. - -[source,python] ----- -np.random.seed(3) -p = Perceptron(learning_rate=0.01, n_epochs=1000, activation='linear') -p.train(X_train, y_train) -print(p.weights) -print(p.bias) ----- - -If you implemented the functions correctly, you should see the following output: - -[text] ----- -[-1.08035188 0.47131981 0.09222406 0.46998928 -1.90914324 3.14497775 - -0.01770744 -3.04430895 2.62947786 -1.84244828 -2.03654589 0.79672007 - -2.79553875] -[22.44124231] ----- - -.Deliverables -==== -- Implement the `train` function -- Test the function with the provided code -==== - -=== Question 5 (2 points) - -Finally, let's implement the `test` function. This function will test the perceptron on the given test data. This function should return our summary statistics from the previous project (mean squared error, root mean squared error, mean absolute error, and r squared) in a dictionary. - -To test your function, you can use the below code: -[source,python] ----- -p.test(X_test, y_test) ----- - -If you implemented the function correctly, you should see the following output: - -[text] ----- -{'mse': 19.28836923644003, - 'rmse': 4.391852597303333, - 'mae': 3.2841026903192283, - 'r_squared': 0.6875969898568428} ----- - -.Deliverables -==== -- Implement the `test` function -- Test the function with the provided code -==== - -=== Question 6 (2 points) - -As mentioned in question 4, there are multiple different methods for updating the weights and bias of our class. In this question, please add the following outline to your function: - -- Rename the `train` function to `train_sgd` - -- Add the following function signatures: -[source,python] ----- -def train_bgd(self, X, y): - pass - -def train_mbgd(self, X, y, n_batches=16): - pass - -def train(self, X, y, method='sgd', n_batches=16): - pass ----- - -After you have added these signatures to your class, please implement the `train_bgd` function, which will train the perceptron using Batch Gradient Descent as described in question 4. This function should calculate the weight/bias gradients for every point in the dataset, and then update the weights and bias based on the average gradients at the end of each epoch. - -Additionally, please implement the `train` function to function as a selector for the different training methods. If `method` is set to 'sgd', the function should call the `train_sgd` function. If `method` is set to 'bgd', the function should call the `train_bgd` function. If `method` is set to 'mbgd', the function should call the `train_mbgd` function. If `method` is set to anything else, the function should raise a ValueError. - -To test your functions, you can use the below code: - -[source,python] ----- -np.random.seed(3) -p = Perceptron(learning_rate=0.1, n_epochs=1000, activation='linear') -p.train(X_train, y_train, method='bgd') -print(p.weights) -print(p.bias) -p.test(X_test, y_test) ----- - -If you implemented the function correctly, you should see the following output: - -[text] ----- -[-1.01203489 0.86314322 0.12818681 0.80290412 -2.02780693 3.08686583 - 0.04321048 -3.00595432 2.64831884 -1.92232099 -2.03927489 0.8549853 - -3.67072291] -[22.68586412] -{'mse': 19.14677015365128, - 'rmse': 4.375702246914349, - 'mae': 3.336506171166659, - 'r_squared': 0.6898903916034843} ----- - -=== Question 7 (2 points) - -Finally, please implement the `train_mbgd` function. This function will train the perceptron using Mini-Batch Gradient Descent as described in question 4. This function should split our data into `n_batches` number of batches, and then update the weights and bias based on the average gradients at the end of each batch. - -[NOTE] -==== -You should use the `np.array_split` function to split the data into batches. This function will return a list of numpy arrays, where each array is a batch of data. You can then loop through this list to update the weights and bias for each batch. -==== - -To test your functions, you can use the below code: - -[source,python] ----- -np.random.seed(3) -p = Perceptron(learning_rate=0.1, n_epochs=1000, activation='linear') -p.train(X_train, y_train, method='mbgd') -print(p.weights) -print(p.bias) -p.test(X_test, y_test) ----- - -If you implemented the function correctly, you should see the following output: - -[text] ----- -[-0.97274486 0.67793429 0.08464404 0.72503617 -1.91926787 3.18789867 - 0.01581749 -2.97858639 2.61498091 -1.97518827 -2.00677852 0.89807989 - -3.26179108] -[22.52676272] -{'mse': 19.022979683470613, - 'rmse': 4.361534097478846, - 'mae': 3.2954565543935757, - 'r_squared': 0.6918953571367247} ----- - -Now that we have implemented SGD, BGD, and MBGD, let's compare the mean squared error of each method at each epoch. To do this, we create a new function called `train_mse`, that will test the perceptron on the test data at the end of each epoch and store the mean squared error in a list. We will then plot this list to compare the performance of each method. - -[NOTE] -==== -A common mistake is to create the perceptron object with `n_epochs=n_epochs`. If you do this, the perceptron will train for n_epochs, n_epochs times. Instead, you should create the perceptron object with `n_epochs=1`, and then call the `train` function with `reset_weights=False`, n_epochs times. -==== - -Here is the outline of the function: -[source,python] ----- -def train_mse(X_train, y_train, X_test, y_test, learning_rate=0.01, n_epochs=1000, method='sgd'): - pass ----- - -To test your functions, you can use the below code: - -[source,python] ----- -np.random.seed(3) -sgd_data = train_mse(X_train, y_train, X_test, y_test, learning_rate=0.06, n_epochs=50, method='sgd') -bgd_data = train_mse(X_train, y_train, X_test, y_test, learning_rate=0.06, n_epochs=50, method='bgd') -mbgd_data = train_mse(X_train, y_train, X_test, y_test, learning_rate=0.06, n_epochs=50, method='mbgd') - -import matplotlib.pyplot as plt -plt.plot(sgd_data, label='SGD') -plt.plot(bgd_data, label='BGD') -plt.plot(mbgd_data, label='MBGD') - -plt.xlabel('Epoch') -plt.ylabel('Mean Squared Error') -plt.legend() -plt.show() ----- - -==== -- Implement the `train_mbgd` function -- Implement the `train_mse` function -- Test the functions with the provided code -- How do the graphs of the mean squared error compare between the three methods? Which method do you think is the best? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project10.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project11.adoc deleted file mode 100644 index 0e6faeda7..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project11.adoc +++ /dev/null @@ -1,385 +0,0 @@ -= 401 Project 11 - Regression: Artificial Neural Networks (ANN) - Multilayer Perceptron (MLP) -:page-mathjax: true - -== Project Objectives - -In this project, we will be taking some of what we learned from our Perceptron model and expand upon it to create a functional Artificial Neural Network (ANN) model, specifically a Multi Layer Perceptron (MLP). We will use the same dataset (Boston Housing) to compare the performance of our original perceptron with the new ANN. - -.Learning Objectives -**** -- Understand the basics of artificial neural networks -- Implement a simple artificial neural network -- Train and evaluate an artificial neural network -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -Across this project and the next one, we will be learning about and implementing neural networks. In this project, we will expand upon the perceptron model we implemented in the previous project to create a more complex model known as a Multilayer Perceptron (MLP). This MLP is a form of ANN that consists of multiple layers, where each layer consists of multiple perceptrons. In the next project, we will be implementing a convolutional neural network (CNN), which is a type of ANN that is particularly suited for data that has spatial relationships, such as images or time series. - -In this project, we will use the same dataset as the previous project to compare the performance of our original perceptron model with the new MLP model. We will use the same features and target variable as before. Please run the following code to load the dataset and split it into training and testing sets: -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Additionally, please copy these solutions for activation functions, their derivatives, and evaluation metrics into your notebook: -[source,python] ----- -import numpy as np - -def linear(x): - return x - -def linear_d(x): - return np.ones_like(np.atleast_1d(x)) - -def relu(x): - return np.maximum(x, 0) - -def relu_d(x): - return np.where(x>0, 1, 0) - -def sigmoid(x): - return 1 / (1 + np.exp(-x)) - -def sigmoid_d(x): - return sigmoid(x) * (1-x) - -def tanh(x): - return np.tanh(x) - -def tanh_d(x): - return 1 - (tanh(x)**2) - -def get_mse(y, y_pred): - return np.mean((y - y_pred) ** 2) - -def get_rmse(y, y_pred): - return np.sqrt(get_mse(y,y_pred)) - -def get_mae(y, y_pred): - return np.mean(np.abs(y - y_pred)) - -def get_r_squared(y, y_pred): - return 1 - np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2) - -derivative_functions = { - 'relu': relu_d, - 'sigmoid': sigmoid_d, - 'linear': linear_d, - 'tanh': tanh_d -} - -activation_functions = { - 'relu': relu, - 'sigmoid': sigmoid, - 'linear': linear, - 'tanh': tanh -} ----- - - -Firstly, let's discuss what the structure of a Multilayer Perceptron looks like. An MLP typically consists of an input layer, some number of hidden layers, and an output layer. Each layer consists of multiple nodes or perceptrons, each with their own weights and biases. Each node passes its output to every node in the next layer, creating a fully connected network. The diagram below shows a simple MLP with an input layer consisting of 3 input nodes, 2 hidden layers with 6 and 4 nodes respectively, and an output layer with 1 node. - -[NOTE] -==== -In our MLP, the input layer will simply be the features of our dataset, so the features will be passed directly to the first hidden layer. -==== -image::f24-301-p11-1.PNG[Example MLP, width=792, height=500, loading=lazy, title="MLP Diagram"] - -Throughout this project, we will be implementing 3 main classes: 'Node', 'Layer', and 'MLP'. The 'Node' class represents a single neuron in our network, and will store its weights, biases, and a forward function to calculate its output. The 'Layer' class represents one of the layers in our network, and stores a list of nodes, an activation method and its derivative, and a forward function to calculate the output of all nodes in the layer. The 'MLP' class represents the entire network, and stores a list of layers, a forward function to calculate the output of the entire network, a train function to train the model, and a test function to evaluate the model using our evaluation metrics. - -In this question, we will implement the 'Node' class. Please complete the following code to implement the 'Node' class: - -[source,python] ----- -class Node: - def __init__(self, input_size): - # given input size (number of features for the node): - # initialize self.weights to random values with np.random.randn - # initialize self.bias to 0 - pass - - def forward(self, inputs): - # calculate the dot product of the inputs and weights, add the bias, and return the result. Same as the perceptron model. - pass ----- - -You can test your implementation by running the following code: -[source,python] ----- -np.random.seed(11) -node = Node(3) -inputs = np.array([1, 2, 3]) -output = node.forward(inputs) -print(output) # should print -0.276386648990842 ----- - -.Deliverables -==== -- Completed Node class -- Output of the testing code -==== - -=== Question 2 (2 points) - -Next, we will implement our 'Layer' class. The 'Layer' class is slightly more complex, as it will store a list of nodes, an activation function and its derivative, and a forward function to calculate the output of all nodes in the layer and apply the activation_function. Please complete the following code to implement the 'Layer' class: - -[source,python] ----- -class Layer: - def __init__(self, num_nodes, input_size, activation='relu'): - # set self.nodes to be a list of Node objects, with length num_nodes - - # check if the activation function is supported (a key in one of the provided dictionaries). if not, raise a ValueError - - # set self.activation_func and self.activation_derivative to the correct functions from the dictionaries - pass - def forward(self, inputs): - # Create an list of the forward pass output of each node in the layer - - # Apply the activation function to the list of outputs and return the result - pass ----- - -You can test your implementation by running the following code: -[source,python] ----- -np.random.seed(11) -layer = Layer(3, 3, activation='linear') -inputs = np.array([1, 2, 3]) -output = layer.forward(inputs) -print(output) # should print [-0.27638665 -3.62878191 1.35732812] ----- - -.Deliverables -==== -- Completed Layer class -- Output of the testing code -==== - -=== Question 3 (2 points) - -Now that our Node and Layer class are correct, we can move on to implementing the 'MLP' class. This class will store our list of layers, a forward function to calculate output of the model, a train function to train the model, and a test function to evaluate the model using our evaluation metrics. In this question, we will implement just the initialization, forward, and test functions. Please begin completing the following 'MLP' class outline: - - -[source,python] ----- - -class MLP: - def __init__(self, layer_sizes, activations): - # we are given 'layer_sizes', a list of numbers, where each number is the number of nodes in the layer. - # The first layer should be the number of features in the input data - # We only need to create the hidden and output layers, as the input layer is simply our input data - # For example, if layer_sizes = [4, 5, 2], we should set self.layers = [Layer(5, 4), Layer(2, 5)] - # Additionally, we are given 'activations', a list of strings, where each string is the name of the activation function for the corresponding layer - # len(activations) will always be len(layer_sizes) - 1, as the input layer does not have an activation function - - # Please set self.layers to be a list of Layer objects, with the correct number of nodes, input size, and activation function. - pass - - def forward(self, inputs): - # for each layer in the MLP, call the forward method with the output of the previous layer - # then, return the final output - pass - - def train(self, X, y, epochs=100, learning_rate=0.0001): - for epoch in range(epochs): - for i in range(len(X)): - # Store the output of each layer in a list, starting with the input data - # You should have a list that looks like [X[i], layer1_output, layer2_output, ..., outputlayer_output] - - # find the error, target value - output value - - - # Now, we can perform our backpropagation to update the weights and biases of our model - # We need to start at the last layer and work our way back to the first layer - for j in reversed(range(len(self.layers))): - # get the layer object at index j - - # get the layer_input and layer_output corresponding to the layer. Remember, self.layers does not contain the input, but outputs list above does - - # calculate the gradient for each node in the layer - # same as the perceptron model, -error * activation_derivative(layer_output). - # However, this time it is a vector, as we are calculating the activation_derivative for everything in the layer at once - - - # Now, we must update the error for the next layer. - # This is so that we can calculate the gradient for the next layer - # This is done by taking the dot product of our gradients by the weights of each node in the current layer - - # Finally, we can update the weights and biases of each node in the current layer - # Remember, our gradient is a list, so each node in the layer will have its own corresponding gradient - # Otherwise, the process is the same as the perceptron model. - for k, node in enumerate(layer.nodes): - # update the weights and bias of the node - pass - - def test(self, X, y, methods=['mse', 'rmse', 'mae', 'r_squared']): - # Calculate metrics for each method - # First, get the predictions for each input in X - - # Then, for each method the user wants, call the corresponding function with input y and predictions - - # Finally return a dictionary with the metric as key and the result as value - - pass ----- - -To test your implementation of the initialization, forward, and test functions, you can run the following code: -[source,python] ----- -np.random.seed(11) -mlp = MLP([3, 4, 2], ['relu', 'linear']) -inputs = np.array([1, 2, 3]) -output = mlp.forward(inputs) -print(output) # should print [-1.77205771 -0.04217909] - -X = np.array([[1, 2, 3], [4, 5, 6]]) -y = np.array([[0, 1], [1, 0]]) - -metrics = mlp.test(X, y) -print(metrics) # should print {'mse': 2.698031745867772, 'rmse': 1.6425686426654358, 'mae': 1.6083905323341714, 'r_squared': -9.792126983471087} ----- -.Deliverables -==== -- Implementation of the MLP class '__init__', 'forward', and 'test' methods -- Output of the testing code -==== - -=== Question 4 (2 points) - -Now that we have all of our helper functions, we can work on training our model. This process will be very similar to the perceptron model we implemented in the previous project, but with a few key differences. Please read the helping comments in the 'train' method of the 'MLP' class and complete the code to train the model. - -To test your implementation, we will do 2 things: - -Firstly, we will test our MLP model as just a single perceptron, with the same parameters and starting weights as Questions 4 and 5 in the previous project. If everything is implemented correctly, the output of the perceptron last project and the single perceptron MLP here should be the same. -[source,python] ----- -np.random.seed(3) -mlp = MLP([X_train.shape[1], 1], ['linear']) -mlp.layers[0].nodes[0].weights = np.zeros(X_train.shape[1]) -mlp.train(X_train, y_train, epochs=100, learning_rate=0.01) -print(mlp.layers[0].nodes[0].weights) # should print the same weights as the perceptron model -print(mlp.layers[0].nodes[0].bias) # should print the same bias as the perceptron model -mlp.test(X_test, y_test) # should print the same metrics as the perceptron model ----- - - -Next, we can test our MLP model with multiple nodes and layers. - -[NOTE] -==== -Now that we have multiple nodes and layers, these code cells may take a while to run. Please be patient and give yourself enough time to run these tests. -==== - -[source,python] ----- -np.random.seed(3) -mlp = MLP([X_train.shape[1], 2, 3, 1], ['linear','linear','linear']) -mlp.train(X_train, y_train, epochs=1000, learning_rate=0.0001) -mlp.test(X_test, y_test) # should output {'mse': 17.78775654565155, 'rmse': 4.217553383853197, 'mae': 3.2032070058415836, 'r_squared': 0.7119015806656752} ----- - -.Deliverables -==== -- Implementation of the 'train' method in the 'MLP' class -- Output of perceptron model testing code -- Output of MLP model testing code -==== - -=== Question 5 (2 points) - -If you remember from the previous project, with only a single perceptron there is a limit to the how we can try to improve the model. We can train it for more epochs, or adjust its learning rate. Additionally, we can investigate how SGD, BGD, and MBGD affect its training, but there isn't much beyond that. However, now that we have an MLP model, we can experiment with different numbers of layers, nodes in each layer, the activation functions of those layers, as well as the learning rate, number of epochs, and SGD vs BGD vs MBGD. - -Please experiment with different numbers of layers, number of nodes in each layer, activation functions, learning rates, number of epochs, and/or SGD vs BGD vs MBGD. For this question, please provide a brief summary of what you tried, and what you noticed. You are not required to try and improve the metrics of the model, but you are welcome to try if you would like. - -[IMPORTANT] -==== -This model is VERY sensitive to the learning rate, number of epochs, and the number of nodes in each layer. If you are not seeing any improvement in your metrics, try adjusting these parameters. Additionally, the model may take a long time to train, so please give yourself enough time to experiment with different parameters. It is recommended to have a maximum of 3 hidden layers (not including the input and output layers) and a maximum of 10 nodes in each layer to ensure your model trains in a reasonable amount of time. -A common problem you may face is the vanishing gradient and exploding gradient problem. This is when the gradients of the weights become very small or large, respectively, and the model is unable to learn. You will know you have exploding gradients if your outputs become nan, inf, or some extremely large number. You may have vanishing gradients if your model seems to not be learning at all. Learning rate and number of epochs are the most common ways to combat these problems, but you may also need to experiment with different activation functions and the number of nodes and layers. -==== - -.Deliverables -==== -- Student has some code that shows them adjusting parameters and experimenting with different configurations -- Student has a brief summary of what they tried and what they noticed -==== - -=== Question 6 (2 points) - -Currently, we are simply filling the weights of our nodes with random values. However, depending on the activation function of the layer, we may want to initialize our weights differently to help promote model convergence and avoid potential gradient problems. There are many different weight initialization methods depending on the activation function, however there are 2 extremely popular choices: Xavier Initialization and He Initialization. These methods are described below: - -[cols="4,4,4", options="header"] -|=== -| Initialization Method | Description | Formula -| Xavier | Commonly used for tanh and sigmoid activation functions to help ensure that the variance is maintained throughout the model | $W =np.random.normal(0, np.sqrt(2/(input\_size+output\_size)), input\_size)$ -| He | Used for ReLU based activation functions to ensure that they do not vanish | $W = np.random.normal(0, np.sqrt(2/inputs), inputs)$ -|=== - -[NOTE] -The form of Xavier depicted above is for a normal distribution. However, there also exists a uniform distribution version of Xavier Initialization, with the formula $W = np.random.uniform(-\sqrt{6/(input\_size+output\_size)}, \sqrt{6/(input\_size+output\_size)}, input\_size)$. You are not required to implement this version, but you are welcome to if you would like. - -Please modify the 3 main classes to be able to change the initialization function of the weights. The MLP class will now take 3 lists as input: 'layer_sizes', 'activations', and 'initializations'. 'initializations' will be a list of strings, where each string is the name of the initialization function for the corresponding layer. The valid values for this list should be 'random', 'xavier', and 'he'. You will need to modify the 'Node' class to accept an initialization method, and modify the 'Layer' class to pass this method to the 'Node' class. You will also need to modify the 'MLP' class to pass the initialization method to the 'Layer' class. - -After you have implemented this, run the below code to visualize the distributions of the weights to confirm that they are being initialized correctly. -[source,python] ----- -np.random.seed(1) -initialized_mlp = MLP([80,80,80,80], ['relu','relu','relu'], ['random','xavier','he']) - -original_random = initialized_mlp.layers[0].nodes[0].weights -xavier = initialized_mlp.layers[1].nodes[0].weights -he = initialized_mlp.layers[2].nodes[0].weights - -import matplotlib.pyplot as plt - -plt.hist(original_random, bins=50, alpha=0.5, label='Random') -plt.hist(xavier, bins=50, alpha=0.5, label='Xavier') -plt.hist(he, bins=50, alpha=0.5, label='He') -plt.legend(loc='upper right') -plt.show() ----- - -.Deliverables -==== -- Implementation of the 'initializations' parameter in the 'MLP' class -- Modification of the 'Node' and 'Layer' classes to accept and pass the initialization method -- Output of the testing code -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project11.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project12.adoc deleted file mode 100644 index 1134415d9..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project12.adoc +++ /dev/null @@ -1,231 +0,0 @@ -= 401 Project 12 - Regression: Bayesian Ridge Regression -:page-mathjax: true - -== Project Objectives - -In this project, we will be exploring Bayesian Ridge Regression using the scikit-learn library. We will use the beer review dataset to implement Bayesian Ridge Regression and evaluate the performance of the model using various metrics. - -.Learning Objectives -**** -- Understand the concept of Bayesian Ridge Regression -- Implement Bayesian Ridge Regression using scikit-learn -- Evaluate the performance of a Bayesian Ridge Regression model on the beer review dataset -**** - -== Supplemental Reading and Resources - -- https://medium.com/intuition/gentle-introduction-of-bayesian-linear-regression-c83da6b0d1f7[Medium Article on Bayesian Linear Regression] - -== Dataset - -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -== Questions - -=== Question 1 (2 points) - -Bayes Theorem is a fundamental theorem in probability theory. Bayes Theorem allows us to invert conditional probabilities, e.g., if we know the probability of event A given event B occured, we can calculate the probability of event B given event A occured. This theorem can be used in machine learning to estimate the probability of a model parameter given the data. Traditionally, our model parameters are estimated by minimizing our loss function. However, with Bayesian Ridge Regression, the model parameters are treated as random variables, and the posterior distribution of the model parameters is estimated. This allows the model to not only make predictions, but also provide a measure of uncertainty in its predictions. Due to the heavy mathematical nature of Bayesian Ridge Regression, we will not be writing it from scratch in this project. Instead, we will be using the scikit-learn library to implement it. If you would like to learn about the mathematical details of this model, please read the supplemental reading. - -Firstly, let's load the beer reviews sample data. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/beer/reviews_sample.csv') - -df.dropna(subset=['look','smell','taste','feel','overall', 'score'], inplace=True) -X = df[['look','smell','taste','feel', 'overall']] -y = df[['score']] - - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) - -y_train = y_train.to_numpy().ravel() -y_test = y_test.to_numpy().ravel() ----- - -Additionally, load our metric functions by running the code below. -[source,python] ----- -import numpy as np -def get_mse(y, y_pred): - return np.mean((y - y_pred) ** 2) - -def get_rmse(y, y_pred): - return np.sqrt(get_mse(y,y_pred)) - -def get_mae(y, y_pred): - return np.mean(np.abs(y - y_pred)) - -def get_r_squared(y, y_pred): - return 1 - np.sum((y - y_pred) ** 2) / np.sum((y - np.mean(y)) ** 2) ----- - -Next, we will create an instance of scikit-learn's `BayesianRidge` class and fit the model to the training data. We will then use the model to make predictions on the test data. -[source,python] ----- -from sklearn.linear_model import BayesianRidge - -model = BayesianRidge() -model.fit(X_train, y_train) - -y_pred = model.predict(X_test) ----- - -Please calculate and print the mean squared error of the Bayesian Ridge Regression model on the test data, and also the output of the RMSE of the model. - -.Deliverables -==== -- Mean Squared Error of the Bayesian Ridge Regression model on the test data -- Output of the RMSE of the model -==== - -=== Question 2 (2 points) - -A powerful ability of the Bayesian Ridge Regression model, as mentioned earlier, is its ability to provide uncertainty estimates in its predictions. Through scikit-learn, we can access the standard deviation of the posterior distribution of the model parameters. From this, we know the uncertainty in our prediction, allowing us to graph confidence intervals around our predictions. - -Firstly, train the Bayesian Ridge Regression model with only the 'smell', 'taste', and 'feel' columns as our features, using the below code: -[source,python] ----- -X_train_3 = X_train[:,[1,2,3]] -X_test_3 = X_test[:,[1,2,3]] - -model = BayesianRidge() -model.fit(X_train_3, y_train) ----- - -Now, write code to get the y_predictions on the test set, and graph the y_test values and y_pred values on a single graph using matplotlib (you should be familiar with this syntax from previous projects, look back to those if you need a refresher). -[IMPORTANT] -==== -If we leave these unsorted and try to graph it, it will be a complete mess due to the points be randomly selected by the train_test_split function. By sorting one of the arrays, we can graph the points in a more orderly fashion. You can run the below code to sort both the y_test and y_pred arrays based on the y_test values from smallest to largest. -==== -[source,python] ----- -# sort the y_test array from smallest to largest, and use that order to sort the y_pred array -y_test_sorted = y_test[np.argsort(y_test)] -y_pred_sorted = y_pred[np.argsort(y_test)] ----- - -You may notice that the graph is a bit messy as the predictions are not perfect. To get a better visualization, we can overlay our confidence intervals on the graph. A confidence interval is a range of values that is some percentage likely to contain the true value. For example, a 95% confidence interval around a predicted value means that we are 95% confident that the true value lies within that range. The number of standard deviations away from the mean (or predicted value) determines the confidence level. Below is a table of the number of standard deviations and their corresponding confidence levels. Additionally, you can use the following formula to calculate the number of standard deviations away from the mean for a given confidence level: -[cols="2,2", options="header"] -|==== -|Number of Standard Deviations | Confidence Level -|1 | 68.27% -|2 | 95.45% -|3 | 99.73% -|4 | 99.994% -|==== - -How do we get these confidence levels from the model? scikit_learn makes it very easy, by providing an optional argument to the `predict` method. By setting the `return_std` argument to True, the predict method will return a tuple of the list of predictions and a list of the standard deviations for each prediction. Then, we can use the standard deviations to calculate the confidence intervals. - -In order to graph the confidence intervals, you will need to calculate the upper and lower bounds of the confidence interval for each prediction. Then, you can use the matplotlib `fill_between` function to fill in the area between the upper and lower bounds. Please graph the y_test values and the 68.27% confidence intervals of the y_pred values on the same graph. - -.Deliverables -==== -- Graph of the y_test values against the y_pred values -- Graph displaying the y_test values and the 68.27% confidence intervals of the y_pred values -==== - -=== Question 3 (2 points) - -Now that you know how to use the Bayesian Ridge Regression model to get uncertainty estimates in your predictions, let's see how changing other model parameters can affect both our model's performance and uncertainty. The `BayesianRidge` class has several parameters that can be tuned to improve the model's performance. A list of these parameters can be found in the scikit-learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html. For the next 3 questions, we will be exploring the following parameters: -Question 3: `n_iter` - The number of iterations to run the optimization algorithm. The default value is 300. -Question 4: `alpha_1` and `alpha_2 - The shape and inverse scale parameters for the Gamma distribution prior over the alpha parameter. The default values are 1e-6. -Question 5: `lambda_1` and `lambda_2` - The shape and inverse scale parameters for the Gamma distribution prior over the lambda parameter. The default values are 1e-6. - -For this question, please generate 5 models with different values of `n_iter` ranging from 100 to 500 in increments of 100. For each model, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. - -.Deliverables -==== -- RMSE for each model -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for all models -- How does the n_iter parameter affect the model's rmse and uncertainty? -==== - -=== Question 4 (2 points) - -For this question, please select 5 different `alpha_1` values. Then, for each of these values, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. Do the same for the `alpha_2` parameter. - -.Deliverables -==== -- RMSE for each model with a different alpha_1 value -- RMSE for each model with a different alpha_2 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different alpha_1 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different alpha_2 value -- How do the alpha_1 and alpha_2 parameters affect the model's rmse and uncertainty? -==== - -=== Question 5 (2 points) - -For this question, please select 5 different `lambda_1` values. Then, for each of these values, train the model on the training data and calculate the RMSE on the test data. Please print the RMSE for each model. Then, plot the y_test values vs the 95.45% confidence intervals of the y_pred values for all models. Graph each confidence interval as a different color on the same graph. Do the same for the `lambda_2` parameter. - -.Deliverables -==== -- RMSE for each model with a different lambda_1 value -- RMSE for each model with a different lambda_2 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different lambda_1 value -- Graph of the y_test values and the 95.45% confidence intervals of the y_pred values for each model with a different lambda_2 value -- How do the lambda_1 and lambda_2 parameters affect the model's rmse and uncertainty? -==== - -=== Question 6 (2 points) - -Now that you've seen how changing the model parameters can affect the model's performance and uncertainty, write a function that finds the best model parameters for the Bayesian Ridge Regression model. The function should take in a list of valid values for each parameter and returns a tuple of the best parameters and the RMSE of the model with those parameters. You can use the `itertools.product` function to generate all possible combinations of the parameters. The function should also be able to choose the "best" based on an input metric from the common metrics used before (mse, rmse, mae, r_squared), or uncertainty. You can use model.sigma_ to get the standard deviation of the posterior distribution of the model parameters as a measure of uncertainty for the model as a whole. - -The function should look something like this: -[source,python] ----- -def get_best_params(n_iter_values, alpha_1_values, alpha_2_values, lambda_1_values, lambda_2_values, metric='rmse'): - # your code here - combinations = # your code here - best_params = None - best_value = None - for params in combinations: - # train the model with the current parameters - - # calculate the metric value - - - # check if the current model is better than the best model so far - # if it is, update the best model and best value - - return best_params ----- - -To test if your function works, please run the function with the following parameters: -[source,python] ----- -n_iter_values = [100, 200, 300, 400, 500] -alpha_1_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2] -alpha_2_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2] -lambda_1_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2] -lambda_2_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2] - -print(get_best_params(n_iter_values, alpha_1_values, alpha_2_values, lambda_1_values, lambda_2_values, metric='rmse')) -print(get_best_params(n_iter_values, alpha_1_values, alpha_2_values, lambda_1_values, lambda_2_values, metric='uncertainty')) -print(get_best_params(n_iter_values, alpha_1_values, alpha_2_values, lambda_1_values, lambda_2_values, metric='r_squared')) ----- - -.Deliverables -==== -- Output of the best parameters for the Bayesian Ridge Regression model based on RMSE, uncertainty, and r_squared -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project12.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project13.adoc deleted file mode 100644 index 1dec8124a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project13.adoc +++ /dev/null @@ -1,341 +0,0 @@ -= 401 Project 13 - Hyperparameter Tuning -:page-mathjax: true - -== Project Objectives - -In this project, we will be exploring different methods for hyperparameter tuning, and applying them to models from previous projects. - -.Learning Objectives -**** -- Understand the concept of hyperparameters -- Learn different methods for hyperparameter tuning -- Apply hyperparameter tuning to a Random Forest Classifier and a Bayesian Ridge Regression model -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -Hyperparameters are parameters that are not learned by the model, but are rather set by the user before training. They include parameters you should be familiar with, such as the learning rate, the number of layers in a neural network, the number of trees in a random forest, etc. There are many different methods for tuning hyperparameters, and in this project we will explore a few of the common methods. - -[NOTE] -==== -Typically, hyperparameter tuning would be performed with a small subset of the data, and the best or top n models would be selected for further evaluation on the full dataset. For the purposes of this project, we will be using the full dataset for hyperparameter tuning, as the dataset is small enough to do so. -==== - -Firstly, let's load both the iris dataset and boston housing dataset. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) -X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_scaled, y, test_size=0.2, random_state=20) -y_train_iris = y_train_iris.to_numpy() -y_test_iris = y_test_iris.to_numpy() - -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') -X = df.drop(columns=['MEDV']) -y = df[['MEDV']] -scaler = StandardScaler() -X = scaler.fit_transform(X) -X_train_boston, X_test_boston, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) -y_train_boston = y_train.to_numpy() -y_test_boston = y_test.to_numpy() -y_train_boston = y_train_boston.ravel() -y_test_boston = y_test_boston.ravel() ----- - -Random search is a hyperparameter optimization algorithm that, as you might guess from the name, randomly selects hyperparameters to evaluate from a range of possible values. This algorithm is very simple to implement, but greatly lacks the efficiency of other search algorithms. It is often used as the baseline to compare more advanced algorithms against. - -The algorithm is as follows: - -1. Define a set of values for each hyperparameter to search over -2. For a set number of iterations, randomly select values from the set for each hyperparameter -3. Train the model with the selected hyperparameter values -4. Evaluate the model -5. Repeat steps 2-4 for the specified number of iterations -6. Pick the best model - -For this question, you will implement the following function: -[source,python] ----- -import numpy as np -def random_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - np.random.seed(2) - # Initialize best score - best_score = -np.inf - best_model = None - best_params = None - - # Loop over number of iterations - for i in range(n_iter): - # Randomly select hyperparameters with np.random.choice. for each param: valid_choices_list pair in param_dict - # this should result in a dictionary of hyperparameter: value pairs - params = {} - # your code here to fill in params - - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # Update best model if necessary - # your code here - - return best_model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. -[source,python] ----- -# Test case 1 with iris dataset -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': [100, 200, 300, 400, 500], 'alpha_1': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'alpha_2': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'lambda_1': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2], 'lambda_2': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]} - -best_model, best_params, best_score = random_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': [10, 50, 100, 200, 500], 'max_depth': [None, 5, 10, 20, 50], 'min_samples_split': [2, 5, 10, 20, 50], 'min_samples_leaf': [1, 2, 5, 10, 20]} -best_model, best_params, best_score = random_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Random Search -==== - -=== Question 2 (2 points) - -Grid search is another hyperparameter optimization algorithm that is more systematic than random search. It evaluates all possible combinations of hyperparameters within a specified range. This algorithm is very simple to implement, but can be computationally expensive, especially with a large number of hyperparameters and values to search over. - -[NOTE] -==== -This is what you implemented in question 6 of project 12, so use that as a reference for this question if needed. -==== - -The algorithm is as follows: -1. Compute every combination of hyperparameters -2. Train the model with a combination -3. Evaluate the model -4. Repeat steps 2-3 for every combination -5. Pick the best - -[source,python] ----- -from itertools import product - -def grid_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - # Initialize best score - best_score = -np.inf - best_model = None - best_params = None - - # find every combination and store it as a list - # HINT: if you put * before a list, it will unpack the list into individual arguments - combinations = # your code here - - # now that we have every combination of values, repack it into a list of dictionaries (param: value pairs) using zip - param_combinations = # your code here - - # Loop over every combination - for params in param_combinations: - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # Update best model if necessary - # your code here - - return best_model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. -[source,python] ----- -# Test case 1 with iris dataset -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': [100, 200, 300], 'alpha_1': [1e-6, 1e-5, 1e-4], 'alpha_2': [1e-6, 1e-5, 1e-4], 'lambda_1': [1e-6, 1e-5, 1e-4], 'lambda_2': [1e-6, 1e-5, 1e-4]} - -best_model, best_params, best_score = grid_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': [100, 200, 500], 'max_depth': [10, 20, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 5]} -best_model, best_params, best_score = grid_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Grid Search -==== - -=== Question 3 (2 points) - -Bayesian optimization is a more advanced hyperparameter optimization algorithm that uses a probabilistic model to predict the performance of a model with a given set of hyperparameters. It then uses this model to select the next set of hyperparameters to evaluate. This algorithm is more efficient than random search and grid search, but significantly more complex to implement. - -The algorithm is as follows: -1. Define a search space for each hyperparameter to search over -2. Define an object function that takes hyperparameters as an input and scores the model (set_params, fit, score) -3. Run the optimization algorithm to find the best hyperparameters - -For this question, we will be using scikit-optimize, a library designed for model-based optimization in python. Please run the following code cell to install the library. -[source,python] ----- -pip install scikit-optimize ----- - -[NOTE] -==== -You may need to restart the kernel after the installation is complete. -==== - -For this question, you will implement the following function: -[source,python] ----- -from skopt import gp_minimize -from skopt.space import Real, Integer -from skopt.utils import use_named_args - -def bayesian_search(model, param_dict, X_train, y_train, X_test, y_test, n_iter=10): - # For each hyperparameter in param_dict, we need to create a Real or Integer object and add it to the space list. - # both of these classes have the following parameters: low, high, name. Real is for continuous hyperparameters that have floating point values, and Integer is for discrete hyperparameters that have integer values. - # so, for example, if {'max_iter': (1,500), 'alpha_1': (1e-6,1e-2)} is passed in for param_dict: - # We should create an Integer(low=1, high=500, name='max_iter') object for the first param, as it uses integer values - # and a Real(low=1e-6, high=1e-2, name='alpha_1') object for the second param, as it uses floating point values - # - # All of these objects should be added to the space list - - space = [] - # your code here - - # Define the objective function - @use_named_args(space) - def objective(**params): - # Create model with hyperparameters - model.set_params(**params) - - # Train model with model.fit - # your code here - - # Evaluate model with model.score - # your code here - - # as this is a minimization algorithm, it thinks lower scores are better. Therefore, we need to return the negative score - return -score - - # Run the optimization - res = gp_minimize(objective, space, n_calls=n_iter, random_state=0) - - # Get the best parameters - best_params = dict(zip(param_dict.keys(), res.x)) - best_score = -res.fun - - return model, best_params, best_score ----- - -After creating the function, run the following test cases to ensure that your function is working correctly. - -[source,python] ----- -from sklearn.linear_model import BayesianRidge -model = BayesianRidge() -param_dict = {'max_iter': (1,50), 'alpha_1': (1e-6,1e-2), 'alpha_2': (1e-6,1e-2), 'lambda_1': (1e-6,1e-2), 'lambda_2': (1e-6,1e-2)} - -best_model, best_params, best_score = bayesian_search(model, param_dict, X_train_boston, y_train_boston, X_test_boston, y_test_boston, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) - -# Test case 2 with boston housing dataset -from sklearn.ensemble import RandomForestClassifier -model = RandomForestClassifier() -param_dict = {'n_estimators': (100,500), 'max_depth': (5,50), 'min_samples_split': (1,20), 'min_samples_leaf': (1,10)} -best_model, best_params, best_score = bayesian_search(model, param_dict, X_train_iris, y_train_iris, X_test_iris, y_test_iris, n_iter=10) -# print the best parameters and score -print(best_params) -print(best_score) ----- - -.Deliverables -==== -- Outputs of running test cases for Bayesian Search -==== - -=== Question 4 (2 points) - -Now that we have implemented these three hyperparameter tuning algorithms, let's compare their performance to each other. For this question, please apply all three tuning algorithms to a Bayesian Ridge Regression model on the boston housing dataset. In addition to their scores, please also compare the time it takes to run each algorithm. Graph these results using 2 bar charts, one for score and one for time. - -[NOTE] -==== -The Bayseian Ridge Regression model will have a very similar accuracy for all three tuning algorithms. Please have the y-axis of the score plot be adjusted to be from 0.690515 to 0.690517 with axis.set_ylim(0.690515, 0.690517) -==== - -.Deliverables -==== -- Bar charts displaying the scores and times for each hyperparameter tuning algorithm -==== - -=== Question 5 (2 points) - -There are still many other hyperparameter methods that we have not explored. For example, you could have a more complex grid search, a more advanced Bayesian optimization algorithm, or even a genetic algorithm. For this question, please identify, explain, and implement another hyperparameter tuning algorithm. Repeat your code from question 4, but include the new algorithm. How does this algorithm compare to the other three? - -.Deliverables -==== -- Explanation of your new hyperparameter tuning algorithm -- Bar charts displaying the scores and times for each hyperparameter tuning algorithm, including the new algorithm -- Explanation of how the new algorithm compares to the other three -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project13.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project14.adoc deleted file mode 100644 index 2f8b1f6ac..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project14.adoc +++ /dev/null @@ -1,89 +0,0 @@ -= TDM 40100: Project 14 -- 2024 - -**Motivation:** We covered a _lot_ this semester, including machine learning, classifiers, regression, and neural networks. We hope that you have had the opportunity to learn a lot, and to improve your data science skills. For our final project of the semester, we want to provide you with the opportunity to give us your feedback on how we connected different concepts, built up skills, and incorporated real-world data throughout the semester, along with showcasing the skills you learned throughout the past 13 projects! - -**Context:** This last project will work as a consolidation of everything we've learned thus far, and may require you to back-reference your work from earlier in the semester. - -**Scope:** reflections on Data Science learning - -.Learning Objectives: -**** -- Reflect on the semester's content as a whole -- Offer your thoughts on how the class could be improved in the future -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -The Data Mine team is writing a Data Mine book to be (hopefully) published in 2025. We would love to have a couple of paragraphs about your Data Mine experience. What aspects of The Data Mine made the biggest impact on your academic, personal, and/or professional career? Would you recommend The Data Mine to a friend and/or would you recommend The Data Mine to colleagues in industry, and why? You are welcome to cover other topics too! Please also indicate (yes/no) whether it would be OK to publish your comments in our forthcoming Data Mine book in 2025. - -.Deliverables -==== -Feedback and reflections about The Data Mine that we can potentially publish in a book in 2025. -==== - -=== Question 2 (2 pts) - -Reflecting on your experience working with different projects, which one did you find most enjoyable, and why? Illustrate your explanation with an example from one question that you worked on. - -.Deliverables -==== -- A markdown cell detailing your favorite project, why, and a working example and question you did involving that project. -==== - -=== Question 3 (2 pts) - -While working on the projects, how did you validate the results that your code produced? Are there better ways that you would suggest for future students (and for our team too)? Please illustrate your approach using an example from one problem that you addressed this semester. - -.Deliverables -==== -- A few sentences in a markdown cell on how you conducted your work, and a relevant working example. -==== - -=== Question 4 (2 pts) - -Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please cite specific questions and explain both how they confused you and how you would recommend improving them. - -.Deliverables -==== -- A few sentences in a markdown cell on which questions from projects you found confusing, and how they could be written better/more clearly, along with specific examples. -==== - -=== Question 5 (2 pts) - -Please identify 3 skills or topics related to ML, classifiers, regression, neural networks, etc., or data science (in general) that you wish we had covered in our projects. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - -.Deliverables -==== -- A markdown cell containing 3 skills/topics that you think we should've covered in the projects, and an example of why you believe these topics or skills could be relevant and beneficial to students going through the course. -==== -=== OPTIONAL but encouraged: - -Please connect with Dr Ward on LinkedIn: https://www.linkedin.com/in/mdw333/ - -and also please follow our Data Mine LinkedIn page: https://www.linkedin.com/company/purduedatamine/ - -and join our Data Mine alumni page: https://www.linkedin.com/groups/14550101/ - - - -== Submitting your Work - -If there are any final thoughts you have on the course as a whole, be it logistics, technical difficulties, or nuances of course structuring and content that we haven't yet given you the opportunity to voice, now is the time. We truly welcome your feedback! Feel free to add as much discussion as necessary to your project, letting us know how we succeeded, where we failed, and what we can do to make this experience better for all our students and partners in 2025 and beyond. - -We hope you enjoyed the class, and we look forward to seeing you next semester! - -.Items to submit -==== -- firstname_lastname_project14.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project2.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project2.adoc deleted file mode 100644 index 6a7828d04..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project2.adoc +++ /dev/null @@ -1,219 +0,0 @@ -= TDM 40100: Project 02 - Intro to ML - Basic Concepts - -== Project Objectives - -In this project, we will learn how to select an appropriate machine learning model. Understanding specifics of how the models work may help in this process, but other aspects can be investigated for this. - -.Learning Objectives -**** -- Learn the difference between classification and regression -- Learn the difference between supervised and unsupervised learning -- Learn how our dataset influences our model selection -**** - -== Supplemental Reading and Resources - -- https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/[DataMine Examples Book - Choosing a Model] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170339559901081[Probabilistic Machine Learning: An Introduction by Kevin Murphy] - -== Datasets - -- `/anvil/projects/tdm/data/iris/Iris.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` -- `/anvil/projects/tdm/data/forest/REF_SPECIES.csv` - -[NOTE] -==== -The Iris dataset is a classic dataset that is often used to introduce machine learning concepts. You can https://www.kaggle.com/uciml/iris[read more about it here]. -If you would like more information on the boston dataset, https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset[please read here]. -==== - -== Questions - -=== Question 1 (2 points) - -In this project, we will use the Iris dataset and the boston dataset as samples to learn about the various aspects that go into choosing a machine learning model. Let's review last project by loading the Iris and boston datasets, then printing the first 5 rows of each dataset. - -.Deliverables -==== -- Output of running code to print the first 5 rows of both datasets. -==== - -=== Question 2 (2 points) - -One of the most distinguishing features of machine learning is the difference between classification and regression. - -Classification is the process of predicting a discrete class label. For example, predicting whether an email is spam or not spam, whether a patient has a disease or not, or if an animal is a dog or a cat. - -Regression is the process of predicting a continuous quantity. For example, predicting the price of a house, the temperature tomorrow, or the weight of a person. - -[NOTE] -==== -Some columns may be misleading. Just because a column is a number does not mean it is a regression problem. One-hot encoding is a technique used to convert categorical variables into numerical variables (we will cover this deeper in future projects). Therefore, it is important to try and understand what a column represents, as just seeing a number does not necessarily mean it corresponds to a continuous quantity. -==== - -Let's look at the `Species` column of the Iris dataset, and the `MEDV` column of the boston dataset. Based on these columns, classify the type of machine learning problem that we would be solving with each dataset. - -Here's a trickier example: If we have an image of some handwritten text, and we want to predict what the text says, would we be solving a classification or regression problem? Why? - -.Deliverables -==== -- Would we likely be solving a classification or regression problem with the `Species` column of the Iris dataset? Why? -- Would we likely be solving a classification or regression problem with the `MEDV` column of the boston dataset? Why? -- Would we likely be solving a classification or regression problem with the handwritten text example? Why? -==== - -=== Question 3 (2 points) - -Another important distinction in machine learning is the difference between supervised and unsupervised learning. - -Supervised learning is the process of training a model on a labeled dataset. The model learns to map some input data to an output label based on examples in the training data. The Iris dataset is a great example of a supervised learning problem. Our dataset has columns such as `SepalLengthCm`, `SepalWidthCm`, `PetalLengthCm`, and `PetalWidthCm` that contain information about the flower. Additionally, it has a column labeled `Species` that contains the label we want to predict. From these columns, the model can associate the features of the flower with the labeled species. - -We can think of supervised learning as already knowing the answer to a problem, and working backwards to understand how we got there. For example, if we have a a banana, apple, and grape in front of us, we can look at each fruit and their properties (shape, size, color, etc.) to learn how to distinguish between them. We can then use this information to predict a fruit from just its properties in the future. - -For example, given this table of data: -[cols="3,3,3",options="header"] -|=== -| Color | Size | Label -| Yellow | Small | A -| Red | Medium | B -| Red | Large | B -| Yellow | Medium | A -| Yellow | Large | B -| Red | Small | B -|=== - -You should be able to describe a relationship between the color and size, and the resulting label. If you were told an object is yellow and extra large, what would you predict the label to be? - -[IMPORTANT] -==== -The projects in 30100 and 40100 will focus on supervised learning. From our dataset, there will be a single column we want to predict, and the rest will be used to train the model. The column we want to predict is called the label/target, while the remaining columns are called features. -==== - -Unsupervised learning is the process of training a model on an unlabeled dataset. As opposed to the model trying to predict an output variable, the model instead learns patterns in the data without any guidance. This is often used in clustering problems, eg. a store wants to group items based on how often they are purchased together. Examples of this can be seen commonly in recommendation systems (have you ever noticed how Amazon always seems to know what you want to buy?). - -If we had a dataset of fruits that users commonly purchase together, we could use unsupervised learning to create groups of fruits to recommend to users. We don't need to know the answer for what to recommend to the user beforehand; we are simply looking for patterns in the data. - -For example, given the following dataset of shopping carts: -[cols="3,3,3",options="header"] -|=== -| Item 1 | Item 2 | Item 3 -| Apple | Banana | Orange -| Apple | Banana | Orange -| Apple | Grape | Kiwi -| Banana | Orange | Apple -| Orange | Banana | Apple -| Cantelope | Watermelon | Honeydew -| Cantelope | Apple | Banana -|=== - -We could use unsupervised learning to recommend fruits to users right before they check out. If a user had an orange and banana in their cart, what fruit would we recommend to them? - - -.Deliverables -==== -- Predicted label for an object that is yellow and extra large in the table above. -- What fruit would we recommend to a user who has an orange and banana in their cart? -- Should we use supervised or unsupervised learning if we want to predict the `Species` of some data using the Iris dataset? Why? -==== - -=== Question 4 (2 points) - -Another important tradeoff in machine learning is the flexibility of the model versus the interpretability of the model. - -A model's flexibility is defined by its ability to capture complex relationships within the dataset. This can be anything from - -Imagine a simple function `f(x) = 2x`. This function is very easy to interpret, it simply doubles x. However, it is not very flexible, as doubling the input is all it can do. A piecewise function like `f(x) = { x < 5: 2x^2 + 3x + 4, x >= 5: 4x^2 - 7 }` is considered more flexible, because it can model more complex relationships. However it, becomes much more difficult to understand the relationship between the input and output. - -We can also see this complexity increase as we increase the number of variables. `f(x)` will typically be more interpretable than `f(x,y)`, which will typically be more interpretable than `f(x,y,z)`. When we get to a large number of variables, eg. `f(a,b,c,...,x,y,z)`, it can become difficult to understand the impact of each variables on the output. However, a function that captures all of these variables can be very flexible. - -Machine learning models can be imagined in the same way. Many factors, including the type of model and the number of features can impact the interpretability of the model. A function that can accurately capture the relationship between a large number of features and the target variable can be extremely flexible but not understandable to humans. A model that performs some simple function between the input and output may be very interpretable, but as the complexity of that function increases its interpretability decreases. - -An important concept in this regard is the curse of dimensionality. The general idea is that as our number of features (dimensions) increases, the amount of data needed to get a good model exponentially increases. Therefore, it is impractical to have an extreme number of features in our model. Imagine given a 2d function y=f(x). Given some points that we plot, we probably pretty quickly find an approximation of f(x). However, imagine we are given y=f(x1,x2,x3,x4,x5). We would need a lot more points to find an approximation of f(x1,x2,x3,x4,x5), and understand the relationship between y and each of the variables. -Just because we can have a lot of features in our model does not mean we should. - -[NOTE] -==== -`Black box` is a term often used to describe models that are too complex for humans to easily interpret. Large neural networks can be considered black boxes. Other models, such as linear regression, are easier to interpret. Decision Trees are designed to be interpretable, as they have a very simple structure and you can easily follow along with how they operate and make decisions. These easy to interpret models are often called `white box` models. -==== - -Please print the number of columns in the Iris dataset and the boston dataset. Based purely on the number of columns, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why? - -.Deliverables -==== -- How many columns are in the Iris dataset? -- How many columns are in the boston dataset? -- Based purely on the number of features, would you expect a machine learning model trained on the Iris dataset to be more or less interpretable than a model trained on the boston dataset? Why? -==== - -=== Question 5 (2 points) - -Parameterization is the idea of approximating a function or model using parameters. If we have some function `f`, and we have examples of `f(x)` for many different `x`, we can find an approximate function to represent `f`. To make this approximation, we will need to choose some function to represent `f`, along with the parameters of that function. For complex functions, this can be difficult, as we may not understand the relationship between `x` and `f(x)`, or how many parameters are needed to represent this relationship. - -A non-parametrized model does not necessarily mean that the model does not have parameters. However, it means that we don't know how many of these parameters exist or how they are used before training. The model itself will work to figure out what parameters it needs while training on the dataset. This can be visualized with splines, which are a type of curve that can be used to approximate a function. There are also non-parametrized models such as K-Nearest Neighbors Regression, which do not have a fixed number of parameters, and instead learn the function from the data. - -If we have 5 points (x, y) and want to find a function to fit these points, through parameterization we would have a single function with multiple parameters that need to be adjusted to give us the best fit. However, with splines (a form of non-parametrization), we could create a piecewise function, where each piece is a linear function between two points. This function has no parameters, and is created by the model solely based on the data. You can https://the-examples-book.com/starter-guides/data-science/data-modeling/choosing-model/parameterization#splines-as-an-example-of-non-parameterization[read more about splines here]. - -A commonly used non-paramtrized model is k-nearest neighbors, which classifies points by comparing them to existing points in the dataset. In this way, the model does not have any parameters, but instead only learns from the data. - -Linear regression is a parametrized model, where a linear relationship between inputs and output(s) is assumed. The data is then used to identify the values of the parameters to best fit the data. - -[NOTE] -==== -If we already have a good understanding of the data, (eg. we know it to be some linear function or second order polynomial), it is likely best to choose a parametrized model. However, if we don't have an understanding of the data, a non-parametrized model that learns the function from the data may be a better fit. -==== - -To better understand the difference, please run the following code: -[source,python] ----- -import matplotlib.pyplot as plt - -a = [1, 3, 5, 7, 9, 11, 13] -b = [9, 6, 4, 7, 8, 15, 9] -x = [1, 2, 3, 4, 5, 6, 7] - -plt.scatter(x, a, label='Function A') -plt.scatter(x, b, label='Function B') -plt.legend() -plt.xlabel('Feature X') -plt.ylabel('Label y') -plt.show() ----- - -Based on the plots shown, decide if each function would be better approximated by a parametrized or non-parametrized model. - -.Deliverables -==== -- Can you easily describe the relationship between `Feature X` and `Label y` for Function A? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function? -- Can you easily describe the relationship between `Feature X` and `Label y` for Function B? If so, what is the relationship? Would you use a parametrized or non-parametrized model to approximate this function? -==== - -=== Question 6 (2 points) - -As a practical example, we will take a look at the forest species dataset to determine the different aspects that go into choosing our machine learning model. - -Load the forest species dataset and print the shape of the dataset its first 10 rows. - -Based on what you see from those outputs, please answer the following questions: - -.Deliverables -==== -- Could you solve a regression problem with this dataset? What about a classification problem? What column(s) would you use as the target variable in each case? -- Could you use unsupervised learning on this dataset? Supervised learning? Please explain your answer for each. -- Do you think a model trained on all columns of this dataset would be very interpretable? -- Do you think a parametrized model would work well given the number of features? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project3.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project3.adoc deleted file mode 100644 index 33c5e3a6b..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project3.adoc +++ /dev/null @@ -1,294 +0,0 @@ -= TDM 40100: Project 03 - Intro to ML - Data Preprocessing - -== Project Objectives - -Learn how to preprocess data for machine learning models. This includes one-hot encoding, scaling, and train-validation-test splitting. - -.Learning Objectives -**** -- Learn how to encode categorical variables -- Learn why scaling data is important -- Learn how to split data into training, validation, and test sets -**** - - -== Dataset - -- `/anvil/projects/tdm/data/fips/fips.csv` -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -The accuracy of a machine learning model depends heavily on the quality of the dataset used to train it. There are several issues you may encounter if you feed raw data into a model. We explore some of these issues in this project, as well as other necessary steps to format data for machine learning. - -The first step in preprocessing data for supervised learning is to split the dataset into input features and target variable(s). This is because, as you should have learned from project 2, supervised learning models require a dataset of input features and their corresponding target variables. By separating our dataset into these components, we are ensuring that our model is learning the relationship between the correct columns. - -Write code to load the fips dataset into a variable called 'fips_df' using pandas and separate it into 2 dataframes: one containing the input features and the other containing the target variable. Let's use the `CLASSFP` column as the target variable and the rest of the columns as input features. - -[NOTE] -==== -Typically, the dataset storing the target variable is denoted by `y` and the dataset storing the input features is denoted by `X`. This is not required, but it is a common convention that we recommend following. -==== - -To confirm your dataframes are correct, print the columns of each dataframe. - -.Deliverables -==== -- Load the fips dataset using pandas -- Separate the dataset into input features and target variable -- Print the column names of each dataframe -==== - -=== Question 2 (2 points) - -Label encoding is a technique used to change categorical variables into a number format that can be understood by machine learning models. This is necessary for models that require numerical input features (which often is the case). Another benefit of label encoding is that it can decrease the memory usage of the dataset (a single integer value as opposed to a string). - -The basic concept behind how it works is that if there are `n` unique category labels in a column, label encoding will assign a unique integer value to each category label from `0` to `n-1`. - -For example, if we have several colors we can encode them as follows: -|=== -| Color | Encoded Value - -| Red | 0 - -| Green | 1 - -| Blue | 2 - -| Yellow | 3 -|=== -where we have four (n=4) unique colors, so their encoded values range from 0 to 3 (n-1). - -[NOTE] -==== -Label encoding can lead the model to interpret the encoded values as having an order or ranking. In some cases, this is a benefit, such as encoding 'small', 'medium', and 'large' as 0, 1, and 2. However, this can sometimes lead to ordering that is not intended (such as our color example above). This is something to think about when deciding if label encoding is the right choice for a column or dataset. -==== - -Print the first 5 rows from the fips dataset. As you can see, the `CountyName` and `State` columns are categorical variables. If we were to use this dataset for a machine learning model, we would likely need to encode these columns into a numerical format. - -In this question, you will use the `LabelEncoder` class from the `scikit-learn` library to label encode the `CountyName` column from the dataset. - -Fill in and run the following code to label encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called `X`.) -[source,python] ----- -from sklearn.preprocessing import LabelEncoder - -# create a LabelEncoder object -encoder = LabelEncoder() - -# create a copy of the input features to separate the encoded columns -X_label_encoded = X.copy() - -X_label_encoded['COUNTYNAME'] = encoder.fit_transform(X_label_encoded['COUNTYNAME']) ----- - -Now that you have encoded the `COUNTYNAME` column, print the first 5 rows of the X_label_encoded dataset to see the changes. What is the largest encoded value in the `COUNTYNAME` column (i.e., the number of unique counties)? - -[NOTE] -==== -You are not required to use the same variable names (X, X_label_encoded, etc.), but following this convention is strongly recommended. -==== - -.Deliverables -==== -- Print the first 5 rows of the X dataset before encoding -- Label encode the `COUNTYNAME` column in the fips dataset -- Print the first 5 rows of the X_label_encoded dataset after encoding -- Largest encoded value in the `COUNTYNAME` column -==== - -=== Question 3 (2 points) - -As we mentioned last question, label encoding can sometimes lead to undesired hierarchies or ordering with the model. A different encoding approach that alleviates this potential issue is one-hot encoding. Instead of simply assigning a unique integer value to each label, one-hot encoding will create a new binary column for each category label. The value in the binary column will be `1` if the category label is present in the original column, and `0` otherwise. By doing this, the model will not interpret these encoded values as being related, rather as completely separate features. - -To give an example, let's look at how we would use one-hot encoding for the color example in the previous question: -|=== -| Color | Red | Green | Blue | Yellow - -| Red | 1 | 0 | 0 | 0 - -| Green | 0 | 1 | 0 | 0 - -| Blue | 0 | 0 | 1 | 0 - -| Yellow | 0 | 0 | 0 | 1 -|=== -We have four unique colors, so one-hot encoding gives us four new columns to represent these colors. - -The `scikit-learn` library also provides a `OneHotEncoder` class that can be used to one-hot encode categorical variables. In this question, you will use this class to one-hot encode the `STATE` column from the dataset. - -First, print the dimensions of the X dataset to see how many rows and columns are in the dataset before one-hot encoding. - -Run the following code to one-hot encode the input features that need to be encoded. (This code assumes your input features are stored in a variable called `X`.) -[source,python] ----- -from sklearn.preprocessing import OneHotEncoder - -# create a OneHotEncoder object -encoder = OneHotEncoder() - -# create a copy of the input features to separate the encoded columns -X_encoded = X.copy() - -# fit and transform the 'STATE' column -# additionally, convert the output to an array and then cast it to a DataFrame -encoded_columns = pd.DataFrame(encoder.fit_transform(X['STATE']).toarray()) - -# drop the original column from the dataset -X_encoded = X_encoded.drop(['STATE'], axis=1) - -# concatenate the encoded columns -X_encoded = pd.concat([X_encoded, encoded_columns], axis=1) ----- - -Now that you have one-hot encoded the `STATE` column, print the dimensions of the X_encoded dataset to see the changes. You should see the same number of rows as the original dataset, but with a large amount of additional columns for the one-hot encoded variables. Are there any concerns with how many columns were created (hint, think about memory size and the curse of dimensionality)? - -.Deliverables -==== -- How many rows and columns are in the X_encoded dataset after one-hot encoding? -- How many columns were created during one-hot encoding? -- What are some disadvantages of one-hot encoding? -- When would you use one-hot encoding over label encoding? -==== - -=== Question 4 (2 points) - -For this question, let's switch over to the Boston Housing dataset. Load the dataset into a variable called `boston_df`. Print the first 5 rows of the `CRIM`, `CHAS`, `AGE`, and `TAX` columns. Then, write code to find the mean and range of values for each of these columns. - -[NOTE] -==== -You can use `max` and `min` functions to find the maximum and minimum values in a column, respectively. For example, `boston_df['AGE'].max()` will return the maximum value in the `AGE` column. -==== - -Scaling is another important preprocessing step that is often necessary when working with machine learning models. There are many approaches to this, however the goal is to ensure that all features are on a similar scale. Two common techniques are normalization and standardization. Normalization adjusts feature so that all values fall between 0 and 1. Standardization adjusts features to a set mean (typically 0) and standard deviation (typically 1). This is important because many machine learning models are sensitive to the scale of the input features. If the input features are on different scales, the model may give more weight to features with larger values, which can lead to poor performance. - -As you may guess from the previous 2 questions, the `scikit-learn` library provides a `StandardScaler` class that can be used to scale input features. This class standardizes features to a mean of 0 and a standard deviation of 1. - -Run the following code to scale the columns in the Boston dataset. (This code assumes your dataframe is stored in a variable called `boston_df`) - -[source,python] ----- -from sklearn.preprocessing import StandardScaler - -scaler = StandardScaler() - -# scale the SepalLengthCm, SepalWidthCm, PetalLengthCm, and PetalWidthCm columns -X_scaled = scaler.fit_transform(boston_df[['CRIM', 'CHAS', 'AGE', 'TAX']]) - -#convert X_scaled back into a dataframe -X_scaled = pd.DataFrame(X_scaled, index=boston_df.index, columns=['CRIM', 'CHAS', 'AGE', 'TAX']) ----- - -Now that you have scaled the input features, print the mean and range of values for the 4 columns after scaling. you should see that the range of values for each column is now similar, and the mean is close to 0. - -.Deliverables -==== -- Mean and range of values for the `CRIM`, `CHAS`, `AGE`, and `TAX` columns before scaling. -- Mean and range of values for the `CRIM`, `CHAS`, `AGE`, and `TAX` columns after scaling. -- How did scaling the input features affect the mean and range of values? -==== - -=== Question 5 (2 points) - -The final step in preprocessing data for machine learning is to split the dataset into training and testing sets. The training set is the data used to train the model, and the testing set is used to evaluate the model's performance after training. - -[NOTE] -==== -Often times a validation set is also created to help tune the parameters of the model. This is not required for this project, but you may encounter it in other machine learning projects. -==== - -Again, scikit-learn provides everything we need. The `train_test_split` function can be used to split the dataset into training and testing sets. - -This function takes in the input features and target variable(s), along with the test size and randomly splits the dataset into training and testing sets. The test size is the fraction of the dataset that will be used for testing. We can also set a random state to ensure reproducibility. - -If we withhold too much data for testing, the model may not have enough data to learn from. However, if we withhold too little data, the model may become overfit to the training data, and the limited testing data may not be representative of the model's performance. Typically, a test size of 10-30% is used. - -Using our `y` dataframe from Question 1, and the `X_encoded` dataframe from Question 3, split the dataset into training and testing sets. Run the following code to split the dataset. - -[source,python] ----- -from sklearn.model_selection import train_test_split - -X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42) ----- - -[NOTE] -==== -If we wanted to create a validation set, we can use the same function to split `X_train` and `y_train` datasets into training and validation sets. -==== - -Now that you have split the dataset, print the number of rows in the training and testing sets to confirm the split was successful. - -.Deliverables -==== -- Number of rows in the training and testing sets -==== - -=== Question 6 (2 points) - -A common issue with datasets is missing or incomplete data. Perhaps a row is missing information in a column (or multiple for that matter). This can cause serious issues with our model if it is used for training, so it is important to handle missing data before we train our model. - -One way we can deal with missing data is to simply remove the rows that have missing data. This is a very simple approach, but effective if the amount of missing data is small. - -We can check if a row has a missing value in a specific column using the `isnull()` function. For example - -[source,python] ----- -missing_data = df['column_name'].isnull() ----- - -will return a boolean series with `True` for rows that have missing data, and `False` for rows that do not. - -We can also simply use the `dropna` function to remove rows with missing data, and specify to only look in a subset of columns with the `subset` option. For example: - -[source,python] ----- -df = df.dropna(subset=['column_name']) ----- - -will remove rows with missing data in the `column_name` column. - - -For this question, we will modify the Boston dataset to have missing data, and then you will remove the rows with missing data. - -First, run the following code to load the dataset and insert missing data: -[source,python] ----- -import random -boston_df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -random.seed(30) -for col in ['CRIM', 'CHAS', 'AGE', 'TAX']: - #for each row - for i in range(len(boston_df)): - if random.random() < 0.1: - boston_df.loc[i, col] = np.nan ----- - -Now, given what you've learned, write code to answer the deliverables below. - -.Deliverables -==== -- Number of rows missing data in the `CRIM` column -- Number of rows missing data in the `CHAS` column -- Number of rows missing data in the `AGE` column -- Number of rows missing data in the `TAX` column -- Number of rows left in the dataset after removing missing data -- Is it always a good idea to remove rows with missing data (think curse of dimensionality)? Why or why not? Can you think of other ways to handle missing data? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project4.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project4.adoc deleted file mode 100644 index 739917707..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project4.adoc +++ /dev/null @@ -1,203 +0,0 @@ -= TDM 40100: Project 04 - Classifiers - Basics of Classification -:page-mathjax: true - -== Project Objectives - -In this project, we will learn about the basics of classification. We will explore some of the most common classifiers, and their strengths and weaknesses. - - -.Learning Objectives -**** -- Learn about the basics of classification -- Learn classification specific terminology -- Learn how to evaluate the performance of a classifier -**** - -== Supplemental Reading and Resources - -- https://deepai.org/machine-learning-glossary-and-terms/classifier[Machine Learning Glossary - Classifiers] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -A classifier, as you may remember from Project 2, is a machine learning model that uses input features to classify the data. Classifiers can be used to determine if email is spam or not, or determine what kind of flower a plant is. We can split classifiers into 2 major categories: binary classification and multi-class classification. - -Binary classifiers are used when we want to classify binary outcomes, such as testing if a patient is sick or not. Multi-class classifiers are used when we want to classify more than 2 outcomes, such as a color, or a type of flower. - -[NOTE] -==== -Multi-label classifiers are a special case of multi-class classifiers, where multiple classes can be assigned to a single instance. For example, an image containing both a cat and dog would be classified as both a cat and a dog. These are commonly found in image recognition problems. -==== - -Pennsylvania State University has a great lesson on examples of classification problems. You can read about them https://online.stat.psu.edu/stat508/lessons/Lesson01#classification-problems-in-real-life[in section 1.5 here]. Please read through some of these examples, and then come up with your own real world examples of binary and multi-class classification problems. - -.Deliverables -==== -- What is a real world example of binary classification? -- What is a real world example of multi-class classification? -- Is email spam classification (spam or not spam) a binary or multi-class classification problem? -- Is digit recognition (determining numerical digits that are handwritten) a binary or multi-class classification problem? -==== - -=== Question 2 (2 points) - -There are many different classification models. In this course, we will go more in depth into the K-Nearest Neighbors (KNN) model, the Decision Tree model, and the Random Forest model. Each of these models has its own strengths and weaknesses, and is better suited for different types of data. There are many other classification models, and more methods are being developed all the time. We won't go into detail about these models in this project, but it is important to know that they exist and behave differently. - -GeeksforGeeks has a great article on different classification models and their strengths and weaknesses. You can read about them https://www.geeksforgeeks.org/advantages-and-disadvantages-of-different-classification-models/[here]. Please read through this article and then answer the following questions. - -.Deliverables -==== -- Can you name 3 other models that could be used for classification? -- Why is it important to understand the strengths and weaknesses of different classification models? -==== - -=== Question 3 (2 points) - -There are many metrics that can be used to evaluate the performance of a classifier. Some of the most common metrics are accuracy, precision, recall, and F1 score. - -In binary classification, there are 4 possible results from a classifier: - -True Positive: The classifier predicts the presence of a class, when the class is actually present -True Negative: The classifier correctly predicts the absence of a class, when the class is actually absent -False Positive: The classifier predicts the presence of a class, when the class is actually absent -False Negative: The classifier predicts the absence of a class, when the class is actually present - -Accuracy is simply the percentage of correct predictions made by the model. As we learned in Project 3, our data is split into training and testing sets. We can calculate the accuracy of our model by comparing the predicted values on the testing set to the actual values in the testing set. - -$ -\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} -$ - -Precision is a metric that tells us how many of the predictions of a certain class were actually correct. It is calculated by dividing the number of true positives by the number of true positives plus the number of false positives. - -$ -\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} -$ - -Recall is a metric that tells us how many of the actual instances of a certain class were predicted correctly. It is calculated by dividing the number of true positives by the number of true positives plus the number of false negatives. - -$ -\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} -$ - -Finally, the F1 score is the harmonic mean of precision and recall. It is calculated as 2 times the product of precision and recall divided by the sum of precision and recall. This metric is useful when we want to balance precision and recall. - -$ -\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} -$ - -Let's try an example. Given the following table of predictions and actual values, calculate the accuracy, precision, recall, and F1 score. - -[cols="3,3",options="header"] -|=== -|Actual Value |Predicted Value -|Positive |Positive -|Positive |Positive -|Negative |Positive -|Positive |Negative -|Positive |Positive -|Negative |Negative -|Negative |Positive -|Positive |Negative -|Positive |Positive -|Positive |Positive -|=== - -.Deliverables -==== -- Why is accuracy not always the best metric to evaluate the performance of a classifier? -- In your own words, what is the difference between precision and recall? -- Calculate the accuracy, precision, recall, and F1 score for the example above. -==== - -=== Question 4 (2 points) - -There are many applications of classification in the real world. One common application is in the medical field, where classifiers can be used to predict whether a patient has a certain disease based on their symptoms. Another application is in the financial industry, where classifiers can be used to predict whether a transaction is fraudulent or not. - -In more recent years, classifiers have been used in the field of image recognition. For example, classifiers can be used to determine whether an image contains a cat or a dog. More advanced classifiers, such as Haar cascades, can be used to detect faces in images by looking for patterns of light and dark pixels. - -In these uses, there often are privacy concerns associated with the data that is being used. If a company wants to develop a classifier to predict whether a transaction is fraudulent, they may need access to sensitive financial data of normal customers. In more recent times, generative image AIs have concerns about what images they were trained on, and if these artists should have their work used to train these models. - -Another issue to consider is bias within these datasets. If a model is trained on data biased towards a certain group, it may make incorrect predictions or reinforce existing biases. If a dataset contains a thousand images of cats and only 5 images of a frog, the classifier may be unable to accurately predict whether an image contains a frog, and may often times incorrectly classify images as cats. Another way bias can be found is in the training itself. A model may wind up relying on a single feature to make predictions, often times creating bias towards that feature (think race, age, income, nationality, etc). - -There are many ways to address bias in classifiers. Typically, the best way to start is to ensure that the training data is very diverse and representative of the real world. Collecting a large amount of data from a variety of sources helps to ensure that the data is not intrinsically biased. Regularization methods can be used to prevent the model from heavily relying on a single or a small number of features. Finally, fairness metrics and bias detection tools such as Google's "What-If" tool or IBM's "AI Fairness 360 (AIF360)" can be used post training to detect and mitigate biases in the model. - -.Deliverables -==== -- Can you think of any areas where there may be ethical concerns with using classifiers? -- Are there any image recognition applications that you interact with, on a daily basis? -==== - -=== Question 5 (2 points) - -Although classifiers are powerful tools, they are not without their limitations. One significant limitation is that classifiers rely heavily on the data they are trained with. If the training data is biased, incomplete, or not representative of the real world, the classifier may make incorrect predictions. - -Class imbalance is a common problem in classification, where one class has significantly more instances than another. This can lead to classifiers that are biased towards the majority class and perform poorly on the minority class. For example, if my dataset contains 99% cats and 1% dogs, a classifier may simply not have enough data to learn how to classify dogs correctly, and may often times incorrectly classify images as cats. - -An easy way to check our class balance is by creating a chart to visualize the distribution of classes in the dataset. To practice, please load the Iris dataset into a dataframe called `iris_df`. Then, run the below code to generate a pie chart displaying the class distribution. - -[source,python] ----- -import matplotlib.pyplot as plt - -# get the counts of the species column -column_counts = iris_df['Species'].value_counts() - -# graph the pie chart -column_counts.plot.pie(autopct='%1.1f%%') ----- - -*Are the classes in the Iris dataset balanced?* - -Feature engineering is another important aspect of machine learning. Feature engineering is the process of manually selecting or transforming input features in the dataset that are most relevant to the problem at hand. The more irrelevant features a classifier has to work with, the more likely it is to make incorrect predictions. - -A notable idea is the Pareto Principle (aka the 80/20 rule) is the idea that 80% of the effects can be attributed to 20% of the causes. This idea can be observed in a myriad of different situations and fields. In the context of our classification models, this theory says that 20% of our features are responsible for 80% of the predictive power of our model. By identifying what features are important, we can reduce our datasets dimensionality and make our models significantly more efficient and interpretable. - -One example of where features can be removed is in the case of multicollinearity. This is when a set of features are highly correlated with each other (i.e., the data for them is redundant). This can lead to overfitting, as the model cannot truly distinguish between the features. In this case, we can remove all but one of these correlated features to reduce our dataset's dimensionality while avoiding the problems of multicollinearity. - -We previously looked at encoding categorical variables in Project 3. There are many different ways to encode categorical variables, and the best method depends on the type of data and the model being used. This is an example of feature engineering, as we are transforming the data to a more suitable form for the model. - -.Deliverables -==== -- Are the classes in the Iris dataset balanced? -- What are some ways to address class imbalance in a dataset? -- Why is feature engineering important in classification? -==== - -=== Question 6 (2 points) - -An important aspect of classification (and machine learning) is the concept of generalization. Generalization is the model's ability to make accurate predictions on new data. This is very important for deploying a model into the real world, as the model will certainly encounter new data that could be wildly different from the training data. - -Within generalization, there are 2 common problems that can be found: Underfitting and Overfitting. - -Underfitting is when the model is too simple or not flexible enough to capture the underlying patterns or relationships in the dataset. Imagine finding a linear line of best fit for a graph that clearly shows some parabolic relationship. The model is underfitting the data and cannot make an accurate prediction. - -Overfitting is when the model trained too heavily on the dataset, and is unable to understand or generalize new data. This can often happen when the model has too many features, when the model is too complex, or if there is significant noise in the data. - -There are many ways to help ensure a model generalizes well. One common method is L1 and L2 Regularization. Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. L1 regularization adds the absolute value of the coefficients to the loss function, while L2 regularization adds the square of the coefficients to the loss function. This encourages the model to keep the coefficients small, prevent the model from essentially performing its own feature selection. Another technique called dropout can be used in neural networks. This method works by randomly selecting neurons to be "dropped out" from the network during training, encouraging the network to have a more robust understanding of the relationships of features to better generalize to new data. - -.Deliverables -==== -- In your own words, what is underfitting and overfitting? -- What is regularization and how does it help prevent overfitting? -- Can you think of any other ways to help ensure a model generalizes well? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project5.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project5.adoc deleted file mode 100644 index 31e898e0c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project5.adoc +++ /dev/null @@ -1,327 +0,0 @@ -= TDM 40100: Project 05 - Classifiers - K-Nearest Neighbors (KNN) I -:page-mathjax: true - -== Project Objectives - -In this project, we will go learn about the K-Nearest Neighbors (KNN) machine learning algorithm, develop it without the use of a library, and apply it to a small dataset. - -.Learning Objectives -**** -- Learn the mathematics behind a KNN -- Create a KNN -- Use KNN to classify data -**** - - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -First, let's learn the basics of how a KNN works. A KNN operates by calculating the difference between input features to all samples in its existing database, and performing a majority vote between the k closest samples to classify the input features. If k=1, it simply chooses the closest class. If k=3, it takes chooses the majority between the 3 nearest. If there is ever a tie, the default behavior is to select a random class from the tied classes. - -[NOTE] -==== -This random selection during a tie is not ideal, but it is a simple way to handle the case. In the next project, we will explore a way to handle ties in a more sophisticated manner. -==== - -image::f24-301-p5-1.png[KNN Distance Calculation, width=792, height=500, loading=lazy, title="KNN Distance Calculation"] - -Take the above example. Suppose we have some dataset containing 2 classes, represented by blue triangles and orange circles. If we have some unknown point (the green square), we can classify it by finding the k closest points to it and taking a majority vote. In this case, the 5 closest points are shown with dashed lines and labeled in order. - -If k=1, what would the unknown point be classified as? If k=3, what would it be classified as? If k=5, what would it be classified as? - -To think about this simply, let's look at an example with 2 input features. This dataset uses a hue and size to identify fruit. - -[cols=4*] -|=== -|#|Hue | Size| Output Variable -|1|22|1|Banana -|2|27|.9|Banana -|3|87|.05|Grape -|4|84|.03|Grape -|=== - -Given this dataset, we want to identify a fruit with Hue=24, Size=0.95. - -To find the distance between 2d points, you can use the formula - -$ -\text{dist} = \sqrt{(X-X_0)^2 + (Y-Y_0)^2} -$ - -.Deliverables -==== -- What class would our the green square be classified as if k=1? if k=3? if k=5? -- Which point is our unknown fruit closest to? (put the #) -- What fruit should our unknown fruit be classified as, assuming k=1? -- What would happen if we set k=4? -==== - -=== Question 2 (2 points) - -Now that we understand the basics of how a KNN works, let's create a KNN from scratch in python. - -We will still use pandas to load the dataset and scikit-learn to scale and split the data, but we will not use scikit-learn to create the KNN. - -First, let's load the Iris dataset, separate the data into features and labels (hint: the Species column is our target variable), scale the input features, and split the data into training and testing sets (80% training, 20% testing). - -[NOTE] -==== -Please review your work from Project 1 and Project 3 if you need a refresher on how to import a dataset, and how to scale and split data. If you did not complete project 3, please read the https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler.fit_transform[StandardScaler documentation] and the https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split[train_test_split documentation], or ask a TA for help during office hours. -==== - - -[source,python] ----- -# Import libraries -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -# load the dataframe into `df` -'''YOUR CODE TO LOAD THE DATAFRAME''' - -# separate the data into input features 'X' and output variable 'y'. Be sure to remove the 'Id' column from the input features -'''YOUR CODE TO SEPARATE THE DATA''' - -# scale the input features into `X_scaled` -'''YOUR CODE TO SCALE THE INPUT FEATURES''' - -# split the data into training 'X_train' and 'y_train' and testing 'X_test' and 'y_test' sets. Use a test size of 0.2 and random state of 42 -'''YOUR CODE TO SPLIT THE DATA''' ----- -[NOTE] -==== -train_test_split returns 4 variables in the order X_train, X_test, y_train, y_test. Although we provided pandas dataframes, the train_X and test_X variables will be numpy arrays. However, the y_train and y_test variables will remain pandas series. This may cause confusion in future code, so it may be helpful to convert the pandas series to numpy arrays using their `.to_numpy()` function. For example, `y_train = y_train.to_numpy()`. -==== - -*Please print the first 5 rows of the testing input features to confirm whether your data is processed correctly.* - -.Deliverables -==== -- Output the first 5 rows of the testing input features -==== - -=== Question 3 (2 points) - -Now that we have our data loaded, scaled, and split, let's start working on creating a KNN from scratch. - -Over the next 3 questions, we will fill in functions in the KNN class below that are needed to classify new data points and test the model. - -[source,python] ----- -''' -class : `KNN` -init inputs : `X_train` (list[list[float]]), `y_train` (list[str]) - -description : This class stores the training data and classifies new data points using the KNN algorithm. -''' -class KNN: - def __init__(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def train(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def euc_dist(self, point1, point2): - '''YOUR CODE TO CALCULATE THE EUCLIDEAN DISTANCE''' - pass - - def classify(self, new_point, k=1): - '''YOUR CODE TO CLASSIFY A NEW POINT''' - pass - - def test(self, X_test, y_test, k=1): - '''YOUR CODE TO TEST THE MODEL''' - pass ----- - -First, let's fill in the `euc_dist` function that calculates the Euclidean distance between two n-dimensional points. The formula for the Euclidean distance between two points is - -$ -\text{dist} = \sqrt{(X_1-X_2)^2 + (Y_1-Y_2)^2 + ... + (Z_1-Z_2)^2} -$ - -where X, Y, Z, etc. are the n-dimensional coordinates of the two points. - -We can imagine each row in our dataset as a point in n-dimensional space, where n is the number of input features. The Euclidean distance between two points is the straight-line distance between them. It can be difficult to visualize in higher dimensions, but the formula remains the same. - -The inputs for this function are `point1` and `point2`, which are each rows from our dataset. The output should be the float value of the Euclidean distance between the two points. - -[NOTE] -==== -With pandas dataframes, you can perform operations between rows. For example, if you have `row1` and `row2`, you can calculate the difference between them by running `row1 - row2`. This will return a new row with the differences between the two rows. This will be useful for calculating the Euclidean distance between two points. -==== - -One thing that you should learn how to do is test functions that you write. Instead of creating the whole KNN and making sure the code works at the very end, it is important to test each piece of code as we right it. We can create test cases to see if our function is working as expected. Some test cases have been provided to you below. For this function, please create 2-3 test cases of your own to ensure that your function works as expected. - -[NOTE] -==== -In python, we can use the `assert` statement for test cases. If we assert an expression that results in true, the code will continue like nothing happened. However, if the expression results in false, we will receive an `AssertionError`, notifying us that our function is not working as expected. -==== - -[source,python] ----- -import numpy as np -# make a knn object -knn = KNN(X_train, y_train) -# test the euc_dist function -assert knn.euc_dist(np.array([1,2,3]), np.array([1,2,3])) == 0 -assert knn.euc_dist(np.array([1,2,3]), np.array([1,2,4])) == 1 -assert knn.euc_dist(np.array([0,0]), np.array([3,4])) == 5 -# your test cases here: - ----- - -*To test that your function works, calculate the Euclidean distance between the first two rows of the training input features by running the code below.* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.euc_dist(X_train[0], X_train[1])) ----- - -.Deliverables -==== -- Your own test cases for the `euc_dist` function -- Output of calculating the euclidean distance between the first two rows of the training input features -==== - -=== Question 4 (2 points) - -Now that we have a function to calculate the Euclidean distance between two points, let's work on the `classify` function, which will classify a new point using the KNN algorithm. - -To classify a point, we need to calculate the Euclidean distance between the new point and all points in the training data. Then, we can find the `k` closest points and take a majority vote to classify the new point. - -Fill in the `classify` function to classify a new point using the KNN algorithm. If there is a tie, randomly select a class. - -[IMPORTANT] -==== -Since our features and labels are stored in separate variables, it is recommended that you use the `zip` function to iterate over both lists simultaneously. For example, given A=[1,2,3,4] and B=[5,6,7,8], you can use zip(A,B) to create a list [(1,5), (2,6), (3,7), (4,8)]. This will allow you to repackage the features and labels into a single list. -==== - -[NOTE] -==== -To find the `k` closest points, we recommend you to use the `sorted` function with a lambda function as the key. For example, to sort a list in ascending order, you can run `sorted(list, key=lambda x: 'some function involving element x')`. This lambda essentially says for each element x in the list, get a value by running some function and sort based on that value. Another hint is that the 'some function involving element x' should be a function you wrote in the last question... -==== - -Below is some pseudocode to help you get started on the `classify` function. -[source,python] ----- -def classify(self, new_point, k=1): - # combine features and labels into a single list - ### YOUR CODE HERE ### - - # sort the list by the euclidean distance between each point and the new point - ### YOUR CODE HERE ### - - # get the k closest points - ### YOUR CODE HERE ### - - # get the labels of the k closest points - ### YOUR CODE HERE ### - - # find the majority class - ### YOUR CODE HERE ### ----- - - -*To test that your function works, classify the first row of the testing input features using the KNN algorithm with k=3 by running the code below. You should get a classification of `Iris-versicolor`* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.classify(X_test[0], k=3)) ----- - -.Deliverables -==== -- Classification of the first row of the testing input features using the KNN algorithm with k=3 -==== - -=== Question 5 (2 points) - -Now that we are able to classify a single point, let's work on the `test` function, which will test the model on a dataframe of input features and output variables. - -For this function, we simply need to iterate over all points in our input features, classify each point, and compare their classification to the actual output variable. We can then calculate the accuracy of our model by dividing the number of correct classifications by the total number of classifications. - -Below is some pseudocode to help you get started on the `test` function. -[source,python] ----- -def test(self, X_test, y_test, k=1): - # for each point in X_test - ### YOUR CODE HERE ### - # classify the point - ### YOUR CODE HERE ### - - # compare the classification to the actual output variable - # if the classification is correct, increment a counter - ### YOUR CODE HERE ### - - # calculate and return the accuracy of the model - ### YOUR CODE HERE ### ----- -*To test that your function works, test the model on the testing input features and output variables using the KNN algorithm with k=1 by running the code below. You should get an accuracy of 0.9666666666666667* - -[source,python] ----- -# make a knn object -knn = KNN(X_train, y_train) -print(knn.test(X_test, y_test, k=1)) ----- - -.Deliverables -==== -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 -==== - -=== Question 6 (2 points) - -Let's check how the KNN performs on a different dataset. Load the white wine quality dataset from `/anvil/projects/tdm/data/wine_quality/winequality-white.csv` - -[NOTE] -==== -This dataset is delimited with semicolons, not commas. We can specify this when loading the dataset by setting the `sep` parameter of the `pd.read_csv` function to `;`. Additionally, as the column names are surrounded by quotes, we can set the `quotechar` parameter to `"` to remove the quotes from the column names. -==== - -With this dataset, we want to classify the `quality` column based on the other columns. - -Be sure to scale and split the data as you did in Question 2. Use a test size of 0.15 and a random state of 20 to split the data. - -Then, create a KNN object, train the model, and test the model on the testing input features and output variables using the KNN algorithm with k=3. Output the accuracy of the model. - -.Deliverables -==== -- Accuracy of the model on the white wine quality dataset using the KNN algorithm with k=3 -==== - -==== Question 7 (2 points) - -Can you think of any potential problems with the way we are classifying a new point? Can you think of any ways we can modify the algorithm to improve its performance? (Hint: think about feature engineering, think about how we choose given k neighbors, think about ties, etc.) - -.Deliverables -==== -- Your response to the above question. -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project6.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project6.adoc deleted file mode 100644 index ef3e76d4a..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project6.adoc +++ /dev/null @@ -1,332 +0,0 @@ -= TDM 40100: Project 06 - Classifiers - K-Nearest Neighbors (KNN) II - -== Project Objectives - -In this project, we will learn about more advanced techniques for K-Nearest Neighbors (KNN), and continue building our KNN from scratch. - -.Learning Objectives -**** -- Learn about feature engineering -- Learn about better ways to handle ties in KNN -**** - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - -=== Question 1 (2 points) - -In the previous project, we developed a KNN class that is able to classify new data points. If you completed the previous project, you should have a basic understanding of how a KNN works. - -In this question, we will briefly recap last project's code and concepts. Please run the following code to load the Iris dataset, scale the input features, and split the data into training and testing sets. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Then, please run the following code to import the KNN class. If you did the previous project, please use your own KNN class. If you did not complete the previous project, please use the following code to import the KNN class. -[source,python] ----- -class KNN: - def __init__(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def train(self, X_train, y_train): - self.features = X_train - self.labels = y_train - - def euc_dist(self, point1, point2): - # short 1 line approach - return sum((point1 - point2) ** 2) ** 0.5 - - def classify(self, new_point, k=1): - # sort the combined list by the distance from the point in the list to the new point - nearest_labels = [x[1] for x in sorted(zip(self.features, self.labels), key=lambda x: self.euc_dist(x[0], new_point))[:k]] - return max(set(nearest_labels), key=nearest_labels.count) - - def test(self, X_test, y_test, k=1): - # short 1 line approach, efficent list comprehension - return sum([1 for p in zip(X_test, y_test) if self.classify(p[0],k=k)==p[1]])/len(X_test) ----- - -To review the concepts of the KNN algorithm, please answer the following questions. - -.Deliverables -==== -- What is the purpose of the `train` function in the KNN class? -- How does a KNN pick which k neighbors to use when classifying a new point? -- How does a KNN handle ties when classifying a new point? -==== - -=== Question 2 (2 points) - -To review, the KNN works entirely by calculating the Euclidean distance between points in n-dimensional space. This means that scaling our input features is very important, as features with larger scales will have a larger impact on the distance calculation. - -However, uniformly scaling our features may not be the best approach. If we wanted to identify the difference between a red apple and a green apple, the most important feature would be the color of the apple. Therefore, we would want to scale the color feature more than the size feature. - -[NOTE] -==== -This concept of manually assigning the importance of features is an example of feature engineering. We can use existing knowledge (or often times intuition) to determine how important each feature should be in the model. This can greatly improve our model's performance if done right, and can also often lead to a more interpretable model. -==== - -Let's create a new function inside the KNN class that will calculate the euclidean distance between two points, but will take a list of weights to determine how important each feature is. This will allow us to scale the distance between two points based on the importance of each feature. - -[source,python] ----- -def scaled_distance(self, point1, point2, weights): - # firstly, scale weights so they sum to 1 (each weight should be a fraction of the total 1) - ''' YOUR CODE HERE ''' - - # then, scale the 2 points by the weights (multiply each feature in the point by the corresponding weight) - ''' YOUR CODE HERE ''' - - # finally, calculate and return the euclidean distance between the 2 points (use the existing euc_dist function) - ''' YOUR CODE HERE ''' - pass ----- - -*To ensure your implementation is correct, run the below code to calculate the scaled distance between the first row of the training set and the second row of the testing set. The printed values should be 1.4718716551800963 for euc_dist and 0.15147470763642634 for scaled_distance.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.euc_dist(X_train[0], X_test[1])) -print(knn.scaled_distance(X_train[0], X_test[1], [1,1,1,10])) ----- - -.Deliverables -==== -- Output of running the sample code to confirm correct implementation of the scaled_distance function -- What does the distance decreasing when we raised the weight of the last feature mean? -==== - -=== Question 3 (2 points) - -Now that we have code to scale the distance between two points based on the importance of each feature, let's write two functions inside the KNN class to classify a point using weights, and to test the model using weights. - -[NOTE] -==== -These functions will be extremely similar to the existing classify and test functions, but use the scaled_distance function instead of the euc_dist function. -==== - -[source,python] ----- -def classify_weighted(self, new_point, k=1, weights=None): - ''' If weights == None, run the existing classify function ''' - - # now, write the classify function using the scaled_distance function - ''' YOUR CODE HERE ''' - -def test_weighted(self, X_test, y_test, k=1, weights=None): - ''' YOUR CODE TO TEST THE MODEL ''' - pass ----- - -*To test that your functions work, please run the below code to calculate the accuracy of the model with different weights. Your accuracies should be 0.9666666666666667, 0.9666666666666667, and 0.8333333333333334 respectively.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,1])) -print(knn.test_weighted(X_test, y_test, k=1, weights=[1,1,1,10])) -print(knn.test_weighted(X_test, y_test, k=1, weights=[10,1,1,1])) ----- -.Deliverables -==== -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,1] -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[1,1,1,10] -- Accuracy of the model on the testing input features and output variables using the KNN algorithm with k=1 and weights=[10,1,1,1] - -- Does the accuracy of the model change when we change the weights? Why or why not? -==== - -=== Question 4 (2 points) - -One potential limitation of the KNN is that we are simply selecting the class based on the majority of the k nearest neighbors. Suppose we attempt to classify some point with k=3. Suppose this results in finding 2 neighbors of class A and 1 neighbor of class B. In this case, the KNN would classify the point as class A. However, what if the 2 neighbors of class A are very far away from our new point, while the class B neighbor is extremely close? It would probably make more sense to classify the point as class B. - -Additionally, suppose our dataset is unbalanced. We may have hundreds of examples of class A in our dataset, but only a few examples of class B. In this case, it is very likely that the KNN will classify points as class A, even if they are closer to class B neighbors. - -To address this limitation, a common modification to the KNN is to weight the k-nearest neighbors based on their distance to the new point. This means that closer neighbors will have a larger impact on the classification than farther neighbors. Although this is more computationally expensive, it creates a much more robust model. - -Implement a new function inside the KNN class that classifies a new point using weighted neighbors. This function should work similarly to the classify function, but should return the class based on the average distance of each class, as opposed to a simple majority vote. - -[source,python] ----- -def classify_distance(self, new_point, k=1, weights=None): - # follow the same approach as the classify function. however, for each nearest neighbor, we need to save both the label and the distance - # nearest_labels = [(label, distance), ... k times] - ''' YOUR CODE HERE ''' - - # now, we need to select the class based on each distance, not just the label - # we can find the average distance of each class and select the class with the smallest average distance - ''' YOUR CODE HERE ''' - ----- -[NOTE] -==== -It is recommended to use `defaultdict` from the `collections` module to initialize a dictionary with a default value of a list. This will allow you to append to the list without checking if the key exists. -==== - -*To test that your function works properly, we will classify the a test point at different k values. Run the below code to ensure that your function works properly. The output should be 'Iris-versicolor', 'Iris-versicolor', and 'Iris-virginica' respectively.* - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.classify_distance(X_test[8], k=5, weights=None)) -print(knn.classify_distance(X_test[8], k=7, weights=None)) -print(knn.classify_distance(X_test[8], k=9, weights=None)) ----- - -[NOTE] -==== -If you print some debugging information inside the function, you should see that even though at k=9 there are more 'Iris-versicolor' neighbors, the average distance of the 'Iris-virginica' neighbors is smaller and therefore is selected. -==== - -.Deliverables -==== -- Classification test at k=5, 7, and 9. -- Explanation of why the classification changes when we change the k value -- What do you think happens if we set k to the number of training points? -==== - -=== Question 5 (2 points) - -In this project you have learned about feature engineering, feature importance scaling, and different ways to handle ties in KNN. - -Based on what you have learned about KNNs, please answer the following questions. - -.Deliverables -==== -- What is the purpose of feature engineering in machine learning? -- Why is it important to scale input features in KNN? -- What are the advantages and disadvantages of the two approaches to handling ties in KNN? -- What are limitations of the KNN algorithm? -==== - -=== Question 6 (2 points) - -A change that may be beneficial is to only use the distance based weighting when there is a tie in classification. This would allow the model to be more accurate when there is a tie, but not change the classification when there is not a tie. Please make a new classify function called `classify_weighted_ties` that will use the distance based weighting only when there is a tie in classification. - -[source,python] ----- -def classify_weighted_ties(self, new_point, k=1, weights=None): - # try to classify using the normal method - ''' YOUR CODE HERE ''' - - # if there is a tie in labels, classify using the weighted method - ''' YOUR CODE HERE ''' ----- - -To check that your function works properly, run the below code as a test case. - -[source,python] ----- -knn = KNN(X_train, y_train) -print(knn.classify_weighted_ties(X_test[8], k=1, weights=None)) -print(knn.classify_weighted_ties(X_test[8], k=2, weights=None)) -print(knn.classify_weighted_ties(X_test[8], k=3, weights=None)) -print(knn.classify_weighted_ties(X_test[8], k=4, weights=None)) ----- - -.Deliverables -==== -- Output of classifying X_test[8] at k=1,2,3,4 -==== - -=== Question 7 (2 points) - -Another modification that may be beneficial is checking if there is a class imbalance in our dataset. If ther is, we should use the distance based weighting in order to select the class. This is because in the event of class imbalance in our dataset, it is significantly more likely that our point will be classified as the majority class. - -For this, we will make 2 new functions: `get_class_distribution` and `classify_weighted_imbalance`. The `get_class_distribution` function will return a dictionary of every class and its proportion in the training data. (e.g. If 50% of the labels are `Iris-virginica`, the dictionary should contain `'Iris-virginica': 0.5`). The `classify_weighted_imbalance` function will get the distribution of classes and if it detects a class imbalance, it will use the distance based weighting to classify the point. We can detect a class imbalance if the proportion of any class is more than some threshold times the proportion of any other class. - -[source,python] ----- -def get_class_distribution(self): - # get each unique value from self.labels - # count the number of times each unique value appears - # return a dictionary of the class and its proportion in the training data - ''' YOUR CODE HERE ''' - -def classify_weighted_imbalance(self, new_point, k=1, weights=None, threshold=2): - # get the class distribution - ''' YOUR CODE HERE ''' - - # check if there is a class imbalance - ''' YOUR CODE HERE ''' - ----- - -To test this function, we will need to make a new dataset with a class imbalance. Run the below code to create a modified Iris dataframe that has a class imbalance (We will remove 25 examples of the 'Iris-versicolor' class). - -[source,python] ----- -iris_imbalanced = df.copy() - -# remove the first 30 rows of the Iris-setosa class -iris_imbalanced = iris_imbalanced.drop(iris_imbalanced[iris_imbalanced['Species'] == 'Iris-versicolor'].head(25).index) - -X_imbalanced = iris_imbalanced.drop(['Species','Id'], axis=1) -y_imbalanced = iris_imbalanced['Species'] - -scaler = StandardScaler() -X_imbalanced_scaled = scaler.fit_transform(X_imbalanced) - -X_train, X_test, y_train, y_test = train_test_split(X_imbalanced_scaled, y_imbalanced, test_size=0.2, random_state=15) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Then, we can test the functions by running the below code. - -[source,python] ----- -knn = KNN(X_train, y_train) -# check the class distribution -print(knn.get_class_distribution()) - -# classify the 9th row of the testing input features using the original classify function -print(knn.classify(X_test[8], k=5)) - -# classify the 9th row of the testing input features using the imbalance detection classify function -print(knn.classify_weighted_imbalance(X_test[8], k=5, weights=None, threshold=1.5)) ----- - -.Deliverables -==== -- Output of the class distribution -- Output of classifying X_test[8] using the original classify function with k=5 -- Output of classifying X_test[8] using the KNN algorithm with k=5 and threshold=1.5 -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project7.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project7.adoc deleted file mode 100644 index 89cfa8730..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project7.adoc +++ /dev/null @@ -1,268 +0,0 @@ -= 401 Project 07 - Classifiers - Decision Trees -:page-mathjax: true - -== Project Objectives - -In this project, we will learn about Decision Trees and how they classify data. We will use the Iris dataset to classify the species of Iris flowers using Decision Trees. - -.Learning Objectives -**** -- Learn how a Decision Tree works -- Implement a Decision Tree classifier using scikit-learn -**** - -== Supplemental Reading and Resources - -- https://scikit-learn.org/stable/modules/tree.html[Scikit-learn Decision Trees Article] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - - -=== Question 1 (2 points) - -Decision Trees are a supervised learning algorithm that can be used for regression and/or classification problems. They work by splitting the data into subsets depending on the depth of the tree and the features of the data. Its goal in doing this is to create simple decision making rules to help classify the data. Because of this, Decision Trees are very easy to interpret, and often used in problems where interpretability is important. - -These trees can be easily visualized in a format similar to a flowchart. For example, if we want to classify some data point as a dog, horse, or pig, a Decision Tree may look like this: - -image::f24-301-p7-1.PNG[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - -In the above example, then we start at the root node. We then follow each condition until we reach a leaf node, which gives us our classification. - -[NOTE] -==== -In the above example, there is only one condition per node. However, in practice, there can be an unlimited number of conditions per node. These are parameters that can be adjusted when creating the Decision Tree. More conditions in one node can make the tree more complex and potentially more accurate, but it may lead to overfitting and will be harder to interpret. -==== - -Suppose we have some dataset: - -[cols="3*"] -|=== -|Temp | Size | Target -|300 | 1 | A -|350 | 1.1 | A -|427 | 90 | A -|1200 | 1.3 | B -|530 | 1.2 | B -|500 | 20 | C -|730 | 2.1 | B -|640 | 14 | C -|830 | 15.4 | C -|=== - -Please fill in the blanks for the Decision Tree below: - -image::f24-301-p7-1-2.PNG[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - - -.Deliverables -==== -- Answers for the blanks in the Decision Tree. (Please provide the number corresponding to each blank, shown in the top left corner of each box.) -==== - -=== Question 2 (2 points) - -For this question we will use the Iris dataset. As we normally do for the classification section, please load the dataset, scale it, and split it into training and testing sets using the below code. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=20) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -We can create a Decision Tree classifier using scikit-learn's `DecisionTreeClassifier` class. When constructing the class, there are several parameters that we can set to control their behavior. Some examples include: - -- `criterion`: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain. -- `max_depth`: The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than `min_samples_split` samples. -- `min_samples_split`: The minimum number of samples required to split an internal node. -- `min_samples_leaf`: The minimum number of samples required to be at a leaf node. - -In this project, we will explore how these parameters affect our Decision Tree classifier. To start, let's create a Decision Tree classifier with the default parameters and see how it performs on the Iris dataset. - -[source,python] ----- -from sklearn.tree import DecisionTreeClassifier -from sklearn.metrics import accuracy_score - -parameters = { - "max_depth": None, - "min_samples_split": 2, - "min_samples_leaf": 1 -} - -decision_tree = DecisionTreeClassifier(random_state=20, **parameters) -decision_tree.fit(X_train, y_train) - -y_pred = decision_tree.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate with parameters {parameters}') ----- - -.Deliverables -==== -- Output of running the above code to get the model's accuracy -==== - -=== Question 3 (2 points) - -Now that we have created our Decision tree, let's look at how we can visualize it. Scikit-learn provides a function called `plot_tree` that can be used to visualize the Decision Tree. This relies on the `matplotlib` library to plot the tree. The following code can be used to plot a given Decision Tree: - -[NOTE] -==== -The `plot_tree` function has several parameters that can be set to control the appearance of the tree. A full list of parameters can be found (here)[https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree]. -==== - -[source,python] ----- -from sklearn.tree import plot_tree -import matplotlib.pyplot as plt - -plt.figure(figsize=(20,10)) -plot_tree(decision_tree, feature_names=X.columns, class_names=decision_tree.classes_, filled=True, rounded=True) ----- - -After running this code, a graph should be generated showing how the decision tree makes decisions. Leaf nodes (nodes with no children, ie. the final decision) should have 4 values in them, whereas internal nodes (nodes with children, often called decision nodes) contain 4 values and a condition. These 4 values are as follows: - -- criterion score (in this case, gini): The score of the criterion used to split the node. This is a measure of how well the node separates the data. For gini, a score of 0 means the node contains only one class, and higher scores mean that the potential classes are more mixed. -- samples: The number of samples that fall into that node after following the decision path. -- value: An array representing the number of samples of each class that fall into that node. -- class: The class that the node would predict if it were a leaf node. - -Additionally, you can see that every box has been colored. This is done to help represent the class that the node would predict if it were a leaf node, determined by the 'value' array. As you can see, leaf nodes are a single pure color, while decision nodes may be a mix of colors (see the first decision node and the furthest down decision node). - -.Deliverables -==== -- Output of running the above code -- Based on how the tree is structured, what can we say about how similar each class is to each other? Is there a class that differs significantly from the others? -==== - -=== Question 4 (2 points) - -The first parameter we will investigate is the 'max_depth' parameter. This parameter controls how nodes are expanded throughout the tree. A larger max_depth will let the tree make more complex decisions but may lead to overfitting. - -Write a for loop that will iterate through a range of max_depth values from 1 to 10 (inclusive) and store the accuracy of the model for a given max_depth in a list called 'accuracies'. Then, run the code below to plot the accuracies. - -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(1, 11), accuracies) -plt.xlabel('Max Depth') -plt.ylabel('Accuracy') -plt.title('Accuracy vs Max Depth') ----- - -*As we increase the max_depth, what happens to the accuracy of the model? What is the smallest max_depth that gives the maximum accuracy?* - -For now, let's assume that this smallest max_depth for maximum accuracy is the best parameter to use for our model. Please display the decision trees for a max_depth of 1, this optimal max_depth, and a max_depth of 10. - -.Deliverables -==== -- Code that creates the 'accuracies' list -- Output of running the above code -- As we increase the max_depth, what happens to the accuracy of the model? What is the smallest max_depth that gives the maximum accuracy? -- Decision Trees for max_depth of 1, the optimal max_depth, and a max_depth of 10 -- What can we say about the complexity of the tree as max_depth increases? Does a high max_depth lead to uninterpretable trees, or are they still easy to follow? -==== - -=== Question 5 (2 points) - -In addition to the importance of the 'max_depth' parameter, the 'min_samples_split' and 'min_samples_leaf' parameters also have a profound effect on the Decision Tree. These parameters control, respectively, the minimum number of samples at a node to be allowed to split, and the minimum number of samples that a leaf node must have. When these values are left at their default values (2 and 1, respectively), the Decision Tree is allowed to continue splitting nodes until there is only a single sample in each leaf node. This easily leads to overfitting, as the model has created a path for the exact training data, rather than a general rule for the dataset. - -In this question, we will do something similar to what we did in the previous question, however we will do it for both the 'min_samples_split' and 'min_samples_leaf' parameters. For each parameter, we will iterate through a range of values from 2 to the size of our training data (inclusive) and store the accuracy of the model for a given value in a list called 'split_accuracies' and 'leaf_accuracies' respectively. Leave the value for the other parameter at its default. Then, run the code below to plot the accuracies. - -[source,python] ----- -plt.plot(range(2, len(X_train)), split_accuracies) -plt.plot(range(2, len(X_train)), leaf_accuracies) -plt.xlabel('Parameter Value') -plt.ylabel('Accuracy') -plt.legend(['Min Samples Split', 'Min Samples Leaf']) -plt.title('Accuracy vs Split and Leaf Parameter Values') ----- - -.Deliverables -==== -- Code that creates the 'split_accuracies' and 'leaf_accuracies' lists -- Output of running the above code -- What can we say about the effect of the 'min_samples_split' and 'min_samples_leaf' parameters on the accuracy of the model? What values of these parameters would you recommend using for this model? -==== - -=== Question 6 (2 points) - -To get a bit more technical, we will be looking at the 'criterion' parameter. Previously, we described this as a function used to measure the quality of a split. However, there is a bit more to it than that. scikit-learn supports two criteria for Decision Trees: 'gini' and 'entropy'. - -Gini is a function that measures the impurity of a node. Essentially, the decision tree will attempt to minimize the gini impurity of nodes it creates. The mathematical definition of the gini impurity is as follows: - -$ -\text{Gini} = 1 - \sum_{i=1}^{n} p_i^2 -$ - -Where $p_i$ is the probability of class $i$ in the node. This value can range from 0 to 0.5, where 0 is a node with only one class, and 0.5 is a node with an equal number of each class. - -Entropy is a function that measures the information gain of a node. Information gain is a measure of how much we learn about the data through that split. Its formula is defined as follows: - -$ -\text{Entropy} = \sum_{i=1}^{n} - p_i \log_2(p_i) -$ - -Where $p_i$ is the probability of class $i$ in the node. This value can range from 0 to 1, where 0 is a node with only one class, and 1 is a node with an equal number of each class. - -In almost all cases, the two criteria will give very similar results. To understand this better, we will graph the two functions for a range of p values from 1% to 99%. We will assume a binary classification system, so there are only two classes (P(class 1) + P(class 2) = 1). - -Write code to generate lists of gini and entropy values for a range of p_i values from 1% to 99%, in 1% increments. Then, plot the two functions on the same graph using the code below. - -[NOTE] -==== -Remember, the valid range of gini values is from 0 to 0.5, while the range of entropy values is from 0 to 1. For this reason, to validly compare their graphs, you will need to double the gini values so they are on the same scale as the entropy values. -==== - -[source,python] ----- -p_values = np.linspace(0.01, 0.99, 99) -plt.plot(p_values, gini_values) -plt.plot(p_values, entropy_values) -plt.xlabel('P(class 1)') -plt.ylabel('Impurity') -plt.legend(['Gini', 'Entropy']) -plt.title('Gini vs Entropy') ----- - -.Deliverables -==== -- Code that creates the 'gini_values' and 'entropy_values' lists -- Output of running the above code -- What can we say about the differences between the Gini and Entropy functions? Computationally speaking, why do most people use Gini over Entropy? -- In what cases would you recommend using Entropy over Gini? -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project8.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project8.adoc deleted file mode 100644 index 2fd3d7454..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project8.adoc +++ /dev/null @@ -1,327 +0,0 @@ -= 401 Project 08 - Classifiers - Decision Tree Ensembles - -== Project Objectives - -In this project, we will be learning about Extra Trees and Random Forests, two popular ensemble models utilizing Decision Trees. - -.Learning Objectives -**** -- Learn how Extra Trees and Random Forests work -- Implement Extra Trees and Random Forests in scikit-learn -**** - -== Supplemental Reading and Resources - -- https://scikit-learn.org/stable/modules/ensemble.html[Scikit-learn Ensemble Learning Article] - -== Dataset - -- `/anvil/projects/tdm/data/iris/Iris.csv` - -== Questions - - -=== Question 1 (2 points) -In the last project, we learned about Decision Trees. As a brief recap, Decision Trees are a type of model that classify data based on a series of conditions. These conditions are found during training, where the model will attempt to split the data into groups that are as pure as possible (100% pure being a group of datapoints that only contains a single class). As you may remember, one fatal downside of Decision Trees is how prone they are to overfitting. - -Extra Trees and Random Forests help address this downside by creating an ensemble of multiple decision trees. - -To review how a decision tree works, please classify the following three data points using the below decision tree. - -[cols="3,3,3",options="header"] -|=== -| Hue | Weight | Texture -| 10 | 150 | Smooth -| 25 | 200 | Rough -| 10 | 150 | Fuzzy -|=== -image::f24-301-p8-1.png[Example Decision Tree, width=792, height=500, loading=lazy, title="Example Decision Tree"] - -Enseble methods work by creating multiple models and combining their results, but they all do it slightly differently. - -Random Forests work by creating multiple Decision Trees, each trained on a "bootstrapped dataset". This concept of bootstrapping allows the model to turn the original dataset into many slightly different datasets, resulting in many different models. A common and safe method for these bootstrapped datasets is to create a dataset the same size of the original dataset, but allow for resampling the same point multiple times. - -Extra Forests work in a somewhat similar manner. Instead of using the entire dataset to train each tree, however, Extra Trees will only select a random subset of features and data to train each tree. This leads to a more diverse set of trees, which helps reduce overfitting. Additionally, since features may be excluded from some trees, it can help reduce the impact of noisy features and lead to more robust classification splits. - -When making a prediction, each tree in the ensemble will make a prediction, and the final prediction will be the majority vote of all the trees (similar to our KNN) - -If we had the following Random Forest, what classification would the forest make for the same three data points? - -image::f24-301-p8-2.png[Example Random Forest, width=792, height=500, loading=lazy, title="Example Random Forest"] - -.Deliverables -==== -- Predictions of the 3 data points using the Decision Tree -- Predictions of the 3 data points using the Random Forest -==== - -=== Question 2 (2 points) - -Creating a Random Forest in scikit-learn is very similar to creating a Decision Tree. The main difference is that you will be using the `RandomForestClassifier` class instead of the `DecisionTreeClassifier` class. - -Please load the Iris dataset, scale it, and split it into training and testing sets using the below code. -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split - -df = pd.read_csv('/anvil/projects/tdm/data/iris/Iris.csv') -X = df.drop(['Species','Id'], axis=1) -y = df['Species'] - -scaler = StandardScaler() -X_scaled = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=20) - -y_train = y_train.to_numpy() -y_test = y_test.to_numpy() ----- - -Creating a random forest in scikit is just as simple as creating a decision tree. You can create a random forest using the below code. - -[source,python] ----- -from sklearn.ensemble import RandomForestClassifier - -forest = RandomForestClassifier(n_estimators=100, random_state=20) -forest.fit(X_train, y_train) ----- - -Random forests have 1 additional parameter compared to decision trees, `n_estimators`. This parameter simply controls the number of trees in the forest. The more trees you have, typically the more robust your model will be. However, having more trees leads to longer training and prediction times, so you will need to find a balance. - - -Let's see how it performs with 100 n_estimators by running the below code. - -[source,python] ----- -from sklearn.metrics import accuracy_score -y_pred = forest.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate') ----- - -If you remember from the previous project, one of the benefits of Decision Trees is their interpretability, and the ability to display them to understand how they are working. In a large random forest, this is not quite as easy considering how many trees are in the forest. However, you can still display individual trees in the forest by accessing them in the `forest.estimators_` list. Please run the below code to display the first tree in the forest. - -[source,python] ----- -from sklearn.tree import plot_tree -import matplotlib.pyplot as plt - -plt.figure(figsize=(10,7)) -plot_tree(forest.estimators_[0], filled=True) ----- - -Since we are able to access individual trees in the forest, we can also simply use a single tree in the forest to make predictions. This can be useful if you want to understand how a single tree is making predictions, or if you want to see how a single tree is performing. - -.Deliverables -==== -- Accuracy of the Random Forest model with 100 n_estimators -- Display the first tree in the forest -==== - -=== Question 3 (2 points) - -Similar to investigating the Decision Tree's parameters in project 7, let's investigate how the number of trees in the forest affects the accuracy of the model. Additionally, we will also measure the time it takes to train and test the model. - -Please create random forests with 10 through 1000 trees, in increments of 10, and record the accuracy of each model and time it takes to train/test into lists called `accuracies` and `times`, respectively. Plot the number of trees against the accuracy of the model. Be sure to use a `random_state` of 13 for reproducibility. - -Code to display the accuracy of the model is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 1001, 10), accuracies) -plt.xlabel('N_Estimators') -plt.ylabel('Accuracy') -plt.title('Accuracy vs N_Estimators') ----- - -Code to display the time it takes to train and test the model is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 1001, 10), times) -plt.xlabel('N_Estimators') -plt.ylabel('Time') -plt.title('Time vs N_Estimators') ----- - -.Deliverables -==== -- Code to generate the data for the plots -- Graph showing the number of trees in the forest against the accuracy of the model -- Graph showing the numebr of trees in the forest against the time it takes to train and test the model -- What is happening in the first graph? Why do you think this is happening? -- What is the relationship between the number of trees and the time it takes to train and test the model (linear, exponential, etc)? -==== - -=== Question 4 (2 points) - -Now, let's look at our Extra Trees model. Creating an Extra Trees model is the same as creating a Random Forest model, but using the `ExtraTreesClassifier` class instead of the `RandomForestClassifier` class. - -[source,python] ----- -from sklearn.ensemble import ExtraTreesClassifier - -extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=20) -extra_trees.fit(X_train, y_train) ----- - - -Let's see how it performs with 100 n_estimators by running the below code. - -[source,python] ----- -from sklearn.metrics import accuracy_score - -y_pred = extra_trees.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Model is {accuracy*100:.2f}% accurate') ----- - -.Deliverables -==== -- Accuracy of the Extra Trees model with 100 n_estimators -==== - -=== Question 5 (2 points) - -It would be repetitive to investigate how n_estimators affects the accuracy and time of the Extra Trees model, as it would be the same as the Random Forest model. - -Instead, let's look into the differences between the two models. The primary difference between these two models is how they select the data to train each tree. Random Forests use bootstrapping to create multiple datasets, while Extra Trees use a random subset of features and data to train each tree. - -We can see how important each feature is to the model by looking at the `feature_importances_` attribute of the model. This attribute will show how important each feature is to the model, with higher values being more important. Please run the below code to create new Random Forest and Extra Trees models, and diplay the feature importance for each. Then, write your own code to calculate the average number of features being used in each tree for both models. - -[source,python] ----- -import matplotlib.pyplot as plt - -forest = RandomForestClassifier(n_estimators=100, random_state=20, bootstrap=True, max_depth=4) -forest.fit(X_train, y_train) - -extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=20, bootstrap=True, max_depth=4) -extra_trees.fit(X_train, y_train) - -plt.bar(X.columns, forest.feature_importances_) -plt.title('Random Forest Feature Importance') -plt.show() - -plt.bar(X.columns, extra_trees.feature_importances_) -plt.title('Extra Trees Feature Importance') -plt.show() ----- - -.Deliverables -==== -- Code to display the feature importance of the Random Forest and Extra Trees models -- Code to calculate the average number of features being used in each tree for both models -- What are the differences between the feature importances of the Random Forest and Extra Trees models? Why do you think this is? -==== - -=== Question 6 (2 points) - -There is an additional ensemble method we have not yet discussed, known as Boosting. Both the Random Forest and Extra Trees models utilitze something called bootstrap aggregation, 'Bagging' for short. Bagging works by creating multiple models in parallel and averaging their results, each given a slightly different dataset generated through bootstrapping the original. Boosting, on the other hand, works by creating multiple models in sequence. After each model is created, the deficiencies of it will be recorded and used to construct the next model to improve upon those deficiencies. Typically, this can lead to a very very accurate model. However, since it is creating models in sequence with the goal of improving the accuracy based solely on the training data, it is very prone to overfitting. - -Scikit-learn provides three boosting models: AdaBoost, Gradient Boosting, and HistGradient Boosting. Each of these models follow the general theory of boosting, but have slightly different ways of improving upon the deficiencies of the previous model. - -AdaBoost: AdaBoost works by creating a model, then creating a second model that focuses on the deficiencies of the first model. Specifically, it will focus on high error points in the training data. This allows the model to rapidly improve upon the deficiencies of the previous model, however becomes prone to outliers and noise in the data. - -Gradient Boosting: Gradient Boosting works by creating a model, then creating a second model that focuses on the residuals of the first model. These residuals are the difference between the predicted value and the actual value. This allows the model to focus on the errors of the previous model, and can lead to a very accurate model. - -HistGradient Boosting: HistGradient Boosting is a newer model that is optimized to work with very large datasets (10,000+ samples). It works by creating a histogram of the data, then works the same as Gradient Boosting. - -Because the Iris dataset is relatively small, we will not be using HistGradient Boosting. However, we will be using and comparing AdaBoost and Gradient Boosting. - -Please create an AdaBoost model and a Gradient Boosting model using the below code. - -[NOTE] -==== -A fun thing about the AdaBoost model is that it can be used with any model, not just Decision Trees. One argument in its constructor is `estimator`, which allows you to provide a class to use as the base model. By default, it uses a Decision Tree (with depth_tree 1, so it is generally preferred to change this), but you can use any model you want. For this project, we will be using a Decision Tree with a max depth of 3. -==== - -[source,python] ----- -from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier -from sklearn.tree import DecisionTreeClassifier - -adaboost = AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2), n_estimators=100, random_state=20) -adaboost.fit(X_train, y_train) - -gradient_boost = GradientBoostingClassifier(n_estimators=100, random_state=20, max_depth=2) -gradient_boost.fit(X_train, y_train) - -y_pred = adaboost.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'AdaBoost is {accuracy*100:.2f}% accurate') - -y_pred = gradient_boost.predict(X_test) -accuracy = accuracy_score(y_test, y_pred) - -print(f'Gradient Boosting is {accuracy*100:.2f}% accurate') ----- - -.Deliverables -==== -- Accuracy of the AdaBoost model -- Accuracy of the Gradient Boosting model -==== - -=== Question 7 (2 points) -In question 3, we looked at how the number of trees in the forest affected the accuracy of the model. In this question, we will look at how the number of trees in the AdaBoost and Gradient Boosting models affect the accuracy of the model. - -Please write code to find the accuracy of the AdaBoost and Gradient Boosting models with 10 through 500 trees, in increments of 10. Record the accuracy of each model into lists called `ada_accuracies` and `grad_accuracies`, respectively. Additionally, record the training/testing time of each model into lists called `ada_times` and `grad_times`, respectively. Plot the number of trees against the accuracy of the model for both models. Be sure to use a `random_state` of 7 and `max_depth` of 2 for reproducibility. - -Code to graph the accuracy of the models is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 501, 10), ada_accuracies) -plt.plot(range(10, 501, 10), grad_accuracies) -plt.legend(['AdaBoost Accuracy', 'GradientBoost Accuracy']) -plt.xlabel('N_Estimators') -plt.ylabel('Accuracy') -plt.title('Accuracy vs N_Estimators') ----- - -Code to graph the time it takes to train and test the models is below. -[source,python] ----- -import matplotlib.pyplot as plt - -plt.plot(range(10, 501, 10), ada_times) -plt.plot(range(10, 501, 10), grad_times) - -plt.legend(['AdaBoost Accuracy', 'GradientBoost Accuracy']) -plt.xlabel('N_Estimators') -plt.ylabel('Time') -plt.title('Time vs N_Estimators') ----- - -.Deliverables -==== -- How do the accuracies of the models compare to eachother? -- How do their training/testing times compare to eachother? Is there something unexpected in the Time with AdaBoost? Why do you think this is happening? (hint: read the scikitlearn documentation, specifically the `n_estimators` parameter) -==== - -== Submitting your Work - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project8.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project9.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project9.adoc deleted file mode 100644 index a885ad5ed..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-project9.adoc +++ /dev/null @@ -1,385 +0,0 @@ -= 401 Project 09 - Regression: Basics -:page-mathjax: true - -== Project Objectives - -In this project, we will learn about the basics of regression. We will explore common regression techniques and how to interpret their results. We will also investigate the strengths and weaknesses of different regression techniques and how to choose the right one for a given problem. - -.Learning Objectives -**** -- Basics of Regression -- Regression specific terminology and metrics -- Popular regression techniques -**** - -== Supplemental Reading and Resources - -== Dataset - -- `/anvil/projects/tdm/data/boston_housing/boston.csv` - -== Questions - -=== Question 1 (2 points) - -The most common regression technique is linear regression. Have you ever generated a trendline in Excel? If so, that is a form of linear regression! There are multiple forms of linear regression, but the most common is called `simple linear regression`. Other forms include `multiple linear regression`, and `polynomial regression`. - -[NOTE] -==== -It may seem counter intuitive that `polynomial regression` is considered a form of linear regression. When the regression model is trained for some polynomial degree, say y= ax^2 + bx + c, the model does not know that x^2 is the square of x. It instead treats x^2 as a separate variable (z = x^2), ie. y = az + bx + c, thus a linear equation. Colinearity between z and x are an issue, which is why regularization techniques, such as lasso and ridge regression, should be used to help prevent overfitting. -==== - -Each of these forms is slightly different, but at their core, they all attempt to model the relationship between one or more independent variables and one or more dependent variable. - -[cols="4,4,4,4",options="header"] -|=== -| Model | Independent | Dependent | Description -| Simple Linear Regression | 1 variable | 1 variable | Models the relationship between one independent variable and one dependent variable. Think of y = ax + b, where y is the dependent variable, x is the independent variable, and a and b are coefficients. -| Multiple Linear Regression | 2+ variables | 1 variable | Models the relationship between two or more independent variables and one dependent variable. Think z = ax + by + c, where z is the dependent variable, x and y are independent variables, and a, b, and c are coefficients. -| Polynomial Regression | 1 variable | 1 variable | Models the relationship between one or more independent variables and one dependent variable using a polynomial function. Think y = ax^2 + bx + c, where y is the dependent variable, x is the independent variable, and a, b, and c are coefficients. -| Multiple Polynomial Regression | 2+ variables | 1 variable | Models the relationship between two or more independent variables and one dependent variable using a polynomial function. Think z = ax^2 + by^2 + cx + dy + e, where z is the dependent variable, x and y are independent variables, and a, b, c, d, and e are coefficients. -| Multivariate Linear Regression | 2+ variables | 2+ variables | Models the relationship between two or more independent variables and two or more dependent variables. Can be linear or polynomial. Think Y = AX + B. Where Y and X are matrices, and A and B are matrices of coefficients. This allows predictions of multiple dependent variables at once. -| Multivariate Polynomial Regression | 2+ variables | 2+ variables | Same as Multivariate Linear Regression, but for polynomials. Also generalized to Y = AX + B, however X must have all independent variables and their polynomial terms, and A and B must be much larger matrices to store these coefficients. -|=== - -For this question, please run the following code the load our very simple dataset. - -[source,python] ----- -import numpy as np -import matplotlib.pyplot as plt - -# Data -x = np.array([1, 2, 3, 4, 5]) -y = np.array([5.5, 7, 9.5, 12.5, 13]) ----- - -Using the data above, find the best fit line for the data using simple linear regression. Store the slope and y-intercept in the variables `a` and `b` respectively. -[NOTE] -==== -We can find lines of best fit using the np.polyfit function. Although this function is built for polynomial regression, it can be used for simple linear regression by setting the degree parameter to 1. This function returns an array of coefficients, ordered from highest degree to lowest. For simple linear regression (y = mx + b), the first coefficient is the slope (m) and the second is the y-intercept (b). -==== - -[source,python] ----- -# Find the best fit line - -# YOUR CODE HERE -a, b = np.polyfit(x, y, 1) - -# Plot the data and the best fit line - -print(f'Coefficients Found: {a}, {b}') -y_pred = a * x + b - -plt.scatter(x, y) -plt.plot(x, y_pred, color='red') -plt.xlabel('x') -plt.ylabel('y') -plt.show() ----- - -.Deliverables -==== -- Coefficients found by np.polyfit with degree 1 -- Plot of the data and the best fit line -==== - -=== Question 2 (2 points) - -After finding the best fit line, we should have two variables stored: `y`, and `y_pred`. Now that we have these, we can briefly discuss evaluation metrics for regression models. There are many, many metrics that can be used to evaluate regression models. We will discuss a few of the most common ones here, but we implore you to do further research on your own to learn about more metrics. - -[cols="4,4,4,4",options="header"] -|=== -| Metric | Description | Formula | Range -| Mean Squared Error (MSE) | Average of the squared differences between the predicted and actual values. | $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. | $[0, \infty)$ -| Root Mean Squared Error (RMSE) | Square root of the MSE. | $RMSE = \sqrt{MSE}$ | $[0, \infty)$ -| Mean Absolute Error (MAE) | Average of the absolute differences between the predicted and actual values. | $MAE = \frac{1}{n} \sum_{i=1}^{n} \mid y_i - \hat{y}_i \mid $, where $y_i$ is the actual value and $\hat{y}_i$ is the predicted value. | $[0, \infty)$ -| R-Squared | Explains the variance of dependent variables that can be explained by the independent variables. | $R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$, where $SS_{res}$ is the sum of squared residuals (actual - prediction) and $SS_{tot}$ is the total sum of squares (actual - mean of actual). | $[0, 1]$ -|=== - - -Using the variables `y` and `y_pred` from the previous question, calculate the following metrics: MSE, RMSE, MAE, and R-Squared. Write functions for `get_mse`, `get_rmse`, `get_mae`, and `get_r_squared`, that each take in actual and predicted values. Call these functions on y and y_pred and store the results in the variables `mse`, `rmse`, `mae`, and `r_squared` respectively. - -[source,python] ----- -# Calculate the evaluation metrics -# YOUR CODE HERE -def get_mse(y, y_pred): - pass - -def get_rmse(y, y_pred): - pass - -def get_mae(y, y_pred): - pass - -def get_r_squared(y, y_pred): - pass - -mse = get_mse(y, y_pred) -rmse = get_rmse(y, y_pred) -mae = get_mae(y, y_pred) -r_squared = get_r_squared(y, y_pred) - - -print(f'MSE: {mse}') -print(f'RMSE: {rmse}') -print(f'MAE: {mae}') -print(f'R-Squared: {r_squared}') ----- - -.Deliverables -==== -- Output of the evaluation metrics -==== - -=== Question 3 (2 points) - -Now that we understand some evaluation metrics, let's see how polynomial regression compares to simple linear regression on our same dataset. We will explore a range of polynomial degrees and see how the evaluation metrics change. Firstly, let's write a function that will take in an x value and an array of coefficients and return the predicted y value using a polynomial function. - -[source,python] ----- -def poly_predict(x, coeffs): - # y_intercept is the last element in the array - y_intercept = None # your code here - - # predicted value can start as the y-intercept - predicted_value = y_intercept - - # The rest of the elements are the coefficients, so we can determine the degree of the polynomial - coeffs = coeffs[:-1] - current_degree = None # your code here - - # Now, we can iterate through the coefficients and make a sum of coefficient * (x^current_degree) - # remember that the first element in the array is the coefficient for the highest degree, and the last element is the coefficient for the lowest degree - for i, coeff in enumerate(coeffs): - # your code here to increment the predicted value - - pass - - return predicted_value ----- - -Once you have created this function, please run the following code to ensure that it works properly. - -[source,python] ----- -assert poly_predict(2, [1, 2, 3]) == 11 -assert poly_predict(4, [1, 2, 3]) == 27 -assert poly_predict(3, [1, 2, 3, 4, 5]) == 179 -assert poly_predict(4, [2.5, 2, 3]) == 51 -print("poly_predict function is working!") ----- - -Now, we will perform the np.polyfit function for degrees ranging from 2 to 5. For each degree, we will get the coefficients, calculate the predicted values, and then calculate the evaluation metrics. Store the results in a dictionary where the key is the degree and the value is a dictionary containing the evaluation metrics. - -[NOTE] -==== -If you correctly implement this, numpy will issue a warning that says "RankWarning: Polyfit may be poorly conditioned". We expect you to run into this and think about what it means. You can hide this message by running the code -[source,python] ----- -import warnings -warnings.simplefilter("ignore", np.RankWarning) ----- -==== - -[source,python] ----- -results = dict() -for degree in range(2, 6): - # get the coefficients - coeffs = None # your code here - - # Calculate the predicted values - y_pred = None # your code here - - # Calculate the evaluation metrics - mse = get_mse(y, y_pred) - rmse = get_rmse(y, y_pred) - mae = get_mae(y, y_pred) - r_squared = get_r_squared(y, y_pred) - - # Store the results in a new dictionary that is stored in the results dictionary - # eg, results[2] = {'mse': 0.5, 'rmse': 0.7, 'mae': 0.3, 'r_squared': 0.9} - results[degree] = None # your code here - -results ----- - -.Deliverables -==== -- Function poly_predict -- Output of the evaluation metrics for each degree of polynomial regression -- Which degree of polynomial regression performed the best? Do you think this is the best model for this data? Why or why not? -==== - -=== Question 4 (2 points) - -In question 1, we briefly mentioned that regularization techniques are used to help prevent overfitting. Regularization techniques add term to the loss function that penalizes the model for having large coefficients. In practice, this helps make sure that the model is fitting to patterns in the data, rather than noise or outliers. The two most common regularization techniques for machine learning are LASSO (L1 Regulariziation) and Ridge (L2 Regularization). - -LASSO is an acronym for Least Absolute Shrinkage and Selection Operator. Essentially, this regularization technique computes the sum of the absolute values of the coefficients and uses it as the penalty term in the loss function. This helps ensure that the magnitude of coefficients is kept small, and can often lead to some coefficients being set to zero. This essentially helps the model perform feature selection to improve generalization. - -Ridge regression works in a similar matter, however it uses the sum of each coefficient squared instead of the absolute value. This also helps force the model to use smaller coefficients, but typically does not set any coefficients to zero. This typically helps reduce collinearity between features. - - -Now, our 5th degree polynomial from the previous question had perfect accuracy. However, looking at the data yourself, do you really believe that the data is best represented by a 5th degree polynomial? The linear regression model from question 1 is likely the best model for this data. Using the coefficients from the 5th degree polynomial, print the predicted values are for the following x values: - -[source,python] ----- -x_values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) ----- - -Are the predicted y values reasonable for 1 through 5? what about 6 through 10? - -Let's see if we can improve this 5th degree polynomial by using Ridge regression. Ridge regression is implemented in the scikit-learn linear models module under the `Ridge` class. Additionally, to ensure we are using a 5th degree polynomial, we will first need the `PolynomialFeatures`` class from the preprocessing module. Finally, we can use scikit-learn pipelines to chain these two models together through the `make_pipeline` function. The code below demonstrates how to use these three classes together. - -[source,python] ----- -from sklearn.preprocessing import PolynomialFeatures -from sklearn.linear_model import Ridge -from sklearn.pipeline import make_pipeline - -n_degree = 5 -polyfeatures = PolynomialFeatures(degree=n_degree) - -alpha = 0.1 # regularization term. Higher values = more regularization, 0 = simple linear regression -ridge = Ridge(alpha=alpha) - -model = make_pipeline(polyfeatures, ridge) - -# we need to reshape the data to be a 2D array -x = np.array([1, 2, 3, 4, 5]) -x = x.reshape(-1, 1) -y = np.array([5.5, 7, 9.5, 12.5, 13]) - -# fit the model -model.fit(x, y) - -# predict the values -y_pred = model.predict(x) ----- - -Now that you have a fitted Ridge model, what are the coefficients (you can get them with `model.named_steps['ridge'].coef_` and `model.named_steps['ridge'].intercept_`), and how do they compare to the previous 5th degree polynomial's coefficients? Are these predicted values more reasonable for 1 through 5? what about 6 through 10? - -.Deliverables -==== -- Predicted values for x_values using the 5th degree polynomial -- Are the predicted values reasonable for 1 through 5? what about 6 through 10? -- Code to use Ridge regression on the 5th degree polynomial -- How do the coefficients of the Ridge model compare to the 5th degree polynomial? -- Are the L2 regularization predicted values more reasonable for 1 through 5? what about 6 through 10? -==== - -=== Question 5 (2 points) - -As you see from the previous question, Ridge can help penalize large coefficients to help stop overfitting. However, it can never really fully recover when our baseline model is overfit. LASSO, on the other hand, can help us recover from overfitting by setting some coefficients to zero. Let's see if LASSO can help us recover from our overfit 5th degree polynomial. - -LASSO regression is implemented in the scikit-learn linear models module under the `Lasso` class. We can use the same pipeline as before, but with the Lasso class instead of the Ridge class. - -[NOTE] -==== -The Lasso class has an additional parameter, max_iter, which is the maximum number of iterations to run the optimization. For this question, set max_iter=10000. -==== - -After you have done this, let's see how changing the value of `alpha` affects our coefficients. To give an overall value to the coefficients, we will use the L1 method, which is the sum of the absolute values of the coefficients. For example, the below code will give the L1 value of the LASSO coefficients. - -[source,python] ----- -value = np.sum(np.abs(model.named_steps['lasso'].coef_)) ----- - -For each alpha value from .1 to 1 in increments of .01, fit the LASSO model and Ridge model to the data. Calculate the L1 value of the model's coefficients for each alpha value, and store them in the lists `lasso_values` and `ridge_values` respectively. Then, run the below code to plot the alpha values against the L1 values for both the LASSO and Ridge models. - -[source,python] ----- -plt.plot(np.arange(.1, 1.01, .01), lasso_values, label='LASSO') -plt.plot(np.arange(.1, 1.01, .01), ridge_values, label='Ridge') -plt.xlabel('Alpha') -plt.ylabel('L1 Value') -plt.legend() -plt.show() ----- - -.Deliverables -==== -- How do the LASSO model's coefficients compare to the 5th degree polynomial? -- How do the LASSO model's coefficients compare to the Ridge model's coefficients? -- What is the relationship between the alpha value and the L1 value for both the LASSO and Ridge models? -==== - -=== Question 6 (2 points) - -There are many other forms of regression that can be discussed. Complex machine learning models such as Neural Networks and Support Vector Machines can be used for regression. Additionally, our previous classification models can all be adapted to solve regression problems. For example, we can have a KNN calculate the mean of the k nearest neighbors to predict a value. - -As we mentioned in question 1, there are more complex multivariate regression models that are used to solve multiple dependent variable problems. These models can be linear or polynomial, and can be regularized with LASSO or Ridge. - -Let's first implement a Multivariate Linear Regression model. We will use new data before, as we will need multiple dependent variables. We will use the boston housing dataset for this question, and we will try to predict both the `MEDV` and `CRIM` columns. Please run the below code to load the dataset. - -[source,python] ----- -import pandas as pd -from sklearn.preprocessing import StandardScaler -from sklearn.model_selection import train_test_split -df = pd.read_csv('/anvil/projects/tdm/data/boston_housing/boston.csv') - -X = df.drop(columns=['MEDV', 'CRIM']) -y = df[['MEDV', 'CRIM']] - -scaler = StandardScaler() -X = scaler.fit_transform(X) - -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7) ----- - -In order to predict multiple columns at once, we can use the `MultiOutputRegressor` class from the scikit-learn multioutput module. This class takes in another model (linear, polynomial, ridge, etc), and fits it to each dependent variable in the dataset. The below code demonstrates how to use this class with a linear regression model. - -[NOTE] -==== -In the background, the `MultiOutputRegressor` class is fitting a separate model to each dependent variable. This means that the model is not learning the relationship between the dependent variables, but rather the relationship between the independent variables and each dependent variable. This can be useful when the dependent variables are not related, but can be a limitation when they are. -==== - -[source,python] ----- -from sklearn.linear_model import LinearRegression -from sklearn.multioutput import MultiOutputRegressor - -model = LinearRegression() -multi_model = MultiOutputRegressor(model) - -multi_model.fit(X_train, y_train) -y_pred = multi_model.predict(X_test) - -# output the coefficients -print(multi_model.estimators_[0].coef_) ----- - -Given the above code for a Multivariate Linear Regression Model, can you implement a Multivariate LASSO and Ridge model? How do the coefficients of the LASSO and Ridge models compare to the Linear model? (Hint: use an alpha of 0.5) - -Finally, compute the mean squared error and r-squared for each model for each dependent variable. Then, average the results to get a single value for each metric. - -[NOTE] -==== -You can use the `get_mse` and `get_r_squared` functions from question 2 to calculate the evaluation metrics. However, since we have multiple dependent variables, you will need to calculate these metrics for each dependent variable and average them to get a single value for each metric. I recommend you split the y_test and y_pred arrays by column, e.g. `cols = [y_pred[:, 0], y_pred[:, 1]]`, and then iterate through the columns to calculate the metrics. -==== - -.Deliverables -==== -- Code to implement Multivariate LASSO and Ridge models -- How do the coefficients of the LASSO and Ridge models compare to the Linear model? -- Output of the evaluation metrics for each model for each dependent variable -==== - -== Submitting your Work - -.Items to submit -==== -- firstname_lastname_project9.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. **Please** take the time to double check your work. See xref:submissions.adoc[the instructions on how to double check your submission]. - -You **will not** receive full credit if your `.ipynb` file submitted in Gradescope does not **show** all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work. -==== diff --git a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-projects.adoc deleted file mode 100644 index eb8cdedb4..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/40100/40100-2024-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 40100 - -== Important Links - -xref:fall2024/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:fall2024/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm1010010200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Aug 19, 2024 | Aug 30, 2024 -| Academic Integrity Quiz | Aug 19, 2024 | Aug 30, 2024 -| Project 1 - Intro to ML - Using Anvil | Aug 19, 2024 | Aug 30, 2024 -| Project 2 - Intro to ML - Basic Concepts | Aug 22, 2024 | Aug 30, 2024 -| Project 3 - Intro to ML - Data Preprocessing | Aug 29, 2024 | Sep 06, 2024 -| Project 4 - Classifiers - Basics of Classification | Sep 05, 2024 | Sep 13, 2024 -| Outside Event 1 | Aug 19, 2024 | Sep 13, 2024 -| Project 5 - Classifiers - K-Nearest Neighbors (KNN) I | Sep 12, 2024 | Sep 20, 2024 -| Project 6 - Classifiers - K-Nearest Neighbors (KNN) II | Sep 19, 2024 | Sep 27, 2024 -| Project 7 - Classifiers - Decision Trees | Sep 26, 2024 | Oct 04, 2024 -| Outside Event 2 | Aug 19, 2024 | Oct 04, 2024 -| Project 8 - Classifiers - Decision Tree Ensembles | Oct 03, 2024 | Oct 18, 2024 -| Project 9 - Regression: Basics | Oct 17, 2024 | Oct 25, 2024 -| Project 10 - Regression: Perceptrons | Oct 24, 2024 | Nov 01, 2024 -| Project 11 - Regression: Artificial Neural Networks (ANN) - Multilayer Perceptron (MLP) | Oct 31, 2024 | Nov 08, 2024 -| Outside Event 3 | Aug 19, 2024 | Nov 08, 2024 -| Project 12 - Regression: Bayesian Ridge Regression | Nov 7, 2024 | Nov 15, 2024 -| Project 13 - Hyperparameter Tuning | Nov 14, 2024 | Nov 29, 2024 -| Project 14 - Class Survey | Nov 21, 2024 | Dec 06, 2024 -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -// **Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/fall2024/logistics/office_hours.adoc deleted file mode 100644 index b0184d02c..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/logistics/office_hours.adoc +++ /dev/null @@ -1,34 +0,0 @@ -= Fall 2024 Office Hours Schedule -:page-aliases: projects:logistics:office_hours.adoc - -[IMPORTANT] -==== -Office hours after 5 PM will be held exclusively virtually, whereas office hours prior to 5 will be offered both in-person in the lobby of Hillenbrand Hall and remotely. - -Office Hours Zoom Link: https://purdue-edu.zoom.us/s/97774213087 - -Checklist for the Zoom Link: - -* When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 101, your name should be entered as “101 - [Your First Name] [Your Last Name]”. - -* After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. - -* We will continue to follow the office hours schedule as posted on the Examples Book. (https://the-examples-book.com/projects/fall2024/logistics/office_hours) -==== - -[NOTE] -==== -The below calendars represent regularly occurring office hours. Please check your class' Piazza page to view the latest information about any upcoming changes or cancellations for office hours prior to attending. -==== - -== TDM 10100 -image::f24-101-OH.png[10100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="10100 Office Hours Schedule"] - -== TDM 20100 -image::f24-201-OH.png[20100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="20100 Office Hours Schedule"] - -== TDM 30100 -image::f24-301-OH.png[30100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="30100 Office Hours Schedule"] - -== TDM 40100 -image::f24-401-OH.png[40100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="40100 Office Hours Schedule"] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/fall2024/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/fall2024/logistics/syllabus.adoc deleted file mode 100644 index 5a0647c27..000000000 --- a/projects-appendix/modules/ROOT/pages/fall2024/logistics/syllabus.adoc +++ /dev/null @@ -1,276 +0,0 @@ -= Fall 2024 Syllabus - The Data Mine Seminar - -== Course Information - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10100 - The Data Mine I, possible CRNs 12067 or 12072 or 12073 or 12071 or 24448 or 28162 or 28160 or 28161 -TDM 20100 - The Data Mine III, possible CRNs 12117 or 12106 or 12113 or 12118 or 24450 or 28174 or 28166 or 28171 -TDM 30100 - The Data Mine V, possible CRNs 12104 or 12112 or 12115 or 12120 or 24451 or 28173 or 28165 or 28170 -TDM 40100 - The Data Mine VII, possible CRNs 12103 or 12111 or 12114 or 12119 or 24449 or 28172 or 28163 or 28167 -TDM 50100 - The Data Mine Seminar, possible CRNs 15644 or 30617 or 30618 or 30619 or 28177 or 28184 or 28175 -|=== - -*Course credit hours:* -1 credit hour, so you should expect to spend about 3 hours per week doing work for the class - -*Prerequisites:* -TDM 10100 and TDM 10200 can be taken in either order. Both of these courses are introductory. TDM 10100 is an introduction to data analysis in R. TDM 10200 is an introduction to data analysis in Python. - -For all of the remaining TDM seminar courses, students are expected to take the courses in order (with a passing grade), namely, TDM 20100, 20200, 30100, 30200, 40100, 40200. The topics in these courses build on the knowledge from the previous courses. All students, regardless of background are welcome. TDM 50100 is geared toward graduate students and can be taken repeatedly; TDM 50100 meets concurrently with the other courses, at whichever level is appropriate for the graduate students in the course. We can make adjustments on an individual basis if needed. - - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within these pages! -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. Students will also take the quizzes at the beginning of the semester on Brightspace -- link:https://piazza.com[*Piazza*] - Online Q/A Forum -- link:https://datamine.purdue.edu[*The Data Mine's website*] - Helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium—no meal swipe required), and 4:30 pm (https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later; This online meeting is also available to students participating in Seminar from other universities outside of Purdue). There is also an asynchronous class section. All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings. Instead, the seminar time is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. Attendance is not required. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major and from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -. Discover data science and professional development opportunities in order to prepare for a career. -. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -. Design efficient search strategies in order to acquire new data science skills. -. Devise the most appropriate data science strategy in order to answer a research question. -. Apply data science techniques in order to answer a research question about a big data set. - -=== Mapping to Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* In each of the 14 weekly projects, the scope is described at a high level at the very top of the project. Students are expected to tie their analysis on the individual weekly questions back to the stated scope. As an example of the stated scope in a project: Understanding how to use Pandas and be able to develop functions allows for a systematic approach to analyzing data. In this project, students will already be familiar with Pandas but will not (yet) know at the outset how to "develop functions" and take a "systematic approach" to solving the questions. Students are expected to comment on each question about how their "line of inquiry" and "formulation of the question" ties back to the stated scope of the project. As the seminar progresses past the first few weeks, and the students are being asked to tackle more complex problems, they need to identify which Python, SQL, R, and UNIX tools to use, and which statements and queries to run (this is "formulating the questions"), in order to get to analyze the data, derive the results, and summary the results in writing and visualizations ("determining the scope of the investigation"). -. *Locate information using effective search strategies and relevant information sources.* The Data Mine seminar progresses by increasing the complexity of the problems. The students are being asked to solve complex problems using data science tools. Students need to "locate information" within technical documentation, API documentation, online manuals, online discussions such as Stack Overflow, etc. Within these online resources, they need to determine the "relevant information sources" and apply these sources to solve the data analysis problem at hand. They need to understand the context, motivation, technical notation, nomenclature of the tools, etc. We enable students to practice this skill on every weekly project during the semester, and we provide additional resources, such as Piazza (an online discussion platform to interact with peers, teaching assistants, and the instructor), office hours throughout the week, and attending in-person or virtual seminar, for interaction directly with the instructor. -. *Evaluate the credibility of information. The students work together this objective in several ways.* They need evaluate and analyze the "credibility of information" and data from a wide array of resources, e.g., from the federal government, from Kaggle, from online repositories and archives, etc. Each project during the semester focuses attention on a large data repository, and the students need to understand the credible data, the missing data, the inaccurate data, the data that are outliers, etc. Some of the projects for students involve data cleansing efforts, data imputation, data standardization, etc. Students also need to validate, verify, determine any missing data, understand variables, correlation, contextual information, and produce models and data visualizations from the data under consideration. -. *Synthesize and organize information from different sources in order to communicate.* This is a key aspect of The Data Mine. In many of the student projects, they need to assimilate geospatial data, categorical and numerical data, textual data, and visualizations, in order to have a comprehensive data analysis of a system or a model. The students can use help from Piazza, office hours, the videos from the instructor and seminar live sessions to synthesize and organize the information they are learning about, in each project. The students often need to also understand many different types of tools and aspects of data analysis, sometimes in the same project, e.g., APIs, data dictionaries, functions, concepts from software engineering such as scoping, encapsulation, containerization, and concepts from spatial and temporal analysis. Synthesizing many "different sources" to derive and "communicate" the analysis is a key aspect of the projects. -. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* In every project, students need to use "citations to sources" (online and written), "referencing" forums and blogs where their cutting-edge concepts are "documented", proper methods of "quotation" and "citation", documentation of any teamwork, etc. The students have a template for their project submissions in which they are required to provide the proper citation of any sources, collaborations, reference materials, etc., in each and every project that they submit every week. -. *Recognize relevant cultural and other contextual factors when using information.* Students weekly project include data and information on data about (all types of genders), political data, geospatial questions, online forums and rating schema, textual data, information about books, music, online repositories, etc. Students need to understand not only the data analysis but also the "context" in which the data is provided, the data sources, the potential usage of the analysis and its "cultural" implications, etc. Students also complete professional development, attending several professional development and outside-the-classroom events each semester. The meet with alumni, business professionals, data practitioners, data engineers, managers, scientists from national labs, etc. They attend events about the "culture related to data science", and "multicultural events". Students are required to respond in writing to every such event, and their writing is graded and incorporated into the grades for the course. -. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester that sets the stage of these "ethical and legal guidelines and requirements". They have documentation about proper data handling and data management techniques. They learn about the context of data usage, including (for instance) copyrights, the difference between open source and proprietary data, different types of software licenses, the need for confidentiality with Corporate Partners projects, etc. - -=== Assessment of Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -. *Assessment method for this course.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. -. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Students identify which R and Python statements and queries to run (this is formulating the questions), in order to get to the results they think they are looking for (determining the scope of the investigation). -. *Locate information using effective search strategies and relevant information sources.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. The students are being asked to solve complex problems using data science tools. They need to figure out what they are looking to figure out, and to do that they need to figure out what to ask. -. *Evaluate the credibility of information. Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning.* Each week, these projects are graded by teaching assistants based on solutions provided. Some of the projects that students complete in the course involve data cleansing efforts including validation, verification, missing data, and modeling and students must evaluate the credibility as they move through the project. -. *Synthesize and organize information from different sources in order to communicate.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Information on how to complete the projects is learned through many sources and student utilize an experiential learning model. -. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided set and then questions about the data set that engage the student in experiential learning. At the beginning of each project there is a question regarding citations for the project. -. *Recognize relevant cultural and other contextual factors when using information.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. For professional development event assessment – students are required to attend three approved events and then write a guided summary of the event. -. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester, and they are also graded on their proper documentation and usage of data throughout the semester, on every weekly project. - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Access to Brightspace, Gradescope, and Piazza course pages. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - -=== Attendance Policy - -When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. - -For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website. - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don’t wait until the due date to start! -* Be excited to challenge yourself and learn impressive new skills. Don’t get discouraged if something is difficult—you’re here because you want to learn, not because you already know everything! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don’t see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com[the-examples-book.com] -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don’t forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you should keep a copy of the course schedule where you see it easily. -* Read your emails! - - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title -Shared email we all read, datamine-help@purdue.edu -Kevin Amstutz, Senior Data Scientist -Donald Barnes, Guest Relations Administrator -Maggie Betz, Managing Director of The Data Mine at Indianapolis -Kimmie Casale, ASL Tutor -Bryce Castle, Corporate Partners Technical Specialist -Cai Chen, Corporate Partners Technical Specialist -Doug Crabill, Senior Data Scientist -Stacey Dunderman, Program Administration Specialist -Jessica Gerlach, Corporate Partners Technical Specialist -Dan Hirleman, Regional Director of The Data Mine of the Rockies -Jessica Jud, Senior Manager of Expansion Operations -Kali Lacy, Associate Research Engineer -Gloria Lenfestey, Senior Financial Analyst -Nicholas Lenfestey, Interim Managing Director of Corporate Partners -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator -Kim Rechkemmer, Senior Program Administration Specialist -Katie Sanders, Chief Operating Officer -Betsy Satchell, Senior Administrative Assistant -Diva Sharma, Corporate Partners Technical Specialist -Dr. Mark Daniel Ward, Executive Director -|=== - -The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses. -You can use the https://piazza.com[Piazza forum] to get in touch. In particular, Dr. Ward responds to questions on Piazza faster than by email. - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest response by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:fall2024/logistics/office_hours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available times. - -=== Piazza - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes. - -== Assignments and Grades - -=== Course Schedule & Due Dates - -Click below to view the Fall 2024 Course Schedule: - -xref:fall2024/10100/10100-2024-projects.adoc[TDM 10100] - -xref:fall2024/20100/20100-2024-projects.adoc[TDM 20100] - -https://the-examples-book.com/projects/fall2024/30100/30100-2024-projects[TDM 30100] - -xref:fall2024/40100/40100-2024-projects.adoc[TDM 40100] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the following Friday at 11:55 pm Purdue West Lafayette (Eastern) time. This semester, there are 14 weekly projects, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won’t hurt your grade. - -We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren’t done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions. - -Outside of projects, you will also complete 3 Outside Event reflections. More information about these is in the "Outside Event Reflections" section below. -The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the https://catalog.purdue.edu/content.php?catoid=16&navoid=20089[Academic Calendar]. - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 14 projects available over the semester, and your best 10 will count. -* The 4 project grades that are dropped could be from illnesses, absences, travel, family emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office hours. We’re waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the comment sections of your project submission. You need to say who helped you and how they helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be the topic of the seminar and any office hours that week, and then the project will be due by 11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* Throughout the semester, The Data Mine will have many special events and speakers, typically happening in person so you can interact with the presenter, but some may be online and possibly recorded. -* These eligible opportunities will be posted on The Data Mine’s website (https://datamine.purdue.edu/events/[datamine.purdue.edu/events/]) and updated frequently. Feel free to suggest good events that you hear about, too. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. -* You are welcome to do all 3 reflections early. For example, you could submit all 3 reflections in September. -* You must submit your outside event reflection within 1 week of attending the event or watching the recording. -* Follow the instructions on Brightspace for writing and submitting these reflections. -* At least one of these events should be on the topic of Professional Development. These events will be designated by "PD" next to the event on the schedule. -* This semester you will answer questions directly in Gradescope including the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event—it is a reflection. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - - -=== Late Work Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can’t do this if we are waiting for other students to turn their work in. - -=== Grade Distribution - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - -=== Grading Scale - -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. /- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B. -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - - - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10100/20100/30100/40100/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody’s completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I’ll do #2, and he’ll do #3: then we’ll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else’s computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else’s answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else’s outside event reflection rather than attending the event yourself. - - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - -=== xref:fall2023/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources] - -=== Disclaimer -This syllabus is subject to small changes. All questions and feedback are always welcome! diff --git a/projects-appendix/modules/ROOT/pages/index.adoc b/projects-appendix/modules/ROOT/pages/index.adoc deleted file mode 100644 index b3e1fa588..000000000 --- a/projects-appendix/modules/ROOT/pages/index.adoc +++ /dev/null @@ -1,61 +0,0 @@ -= TDM Course Overview - -This page provides a high-level overview of the TDM 100, 200, 300, and 400 level courses. Together, these courses make up The Data Mine's seminar offering for students. - -[IMPORTANT] -==== -The Data Mine is always incorporating new feedback and changing our offerings to fit student's needs. - -As part of that, the focus of the courses below may change in the future. -==== - -== TDM 101/102 - -The 100 level courses serve as an introduction to two of the core coding language in analytics, Python and R. Students will learn about the basic implementation of the coding languages as well as how to apply them to core skills in analytics and data science. - -The course serves as a great introduction to coding languages and core data analytics topics. - -Potential topics include: - -* Introduction to core packages (Pandas, Matplotlib, Numpy, R-Shiny) -* Defining and working with functions -* Processes for data manipulation -* Introduction to data analysis - -== TDM 201/202 - -Building on the coding skills learned in the 100 level courses, the 200 level takes Python and R and discusses how they are used in a high-performance computing (HPC) environment. Students will leverage their core coding skills to learn about topics like code optimization, web scraping, and utilizing GPUs. - -This course is great to build experience working in Python and R in a research environment. - -Potential topics include: - -* Web scraping -* Data visualization -* Code optimization -* Containerization - -== TDM 301/302 - -TDM 300 allows students to take a deeper dive into a specific type of predictive model. The Data Mine is starting with a deep dive into neural nets, but we plan to build out other modeling topics in the future. - -The 300-level course is a great opportunity for students to spend an extended amount of time taking a deep dive into a specific technique. - -*Experience:* This course assumes that you have a background in Python, R, or both coding languages. - -Potential topics include: - -* Introduction to neural networks -* Hyperparameter tuning for deep learning -* CNN's and computer vision -* Ethics in neural networks - -== TDM 401/402 - -The Data Mine's highest current course offering, TDM 400 is an extension of the 300-level course with additional questions focused on understanding and applying the technique of focus. - -TDM 400 is an opportunity for advanced students to drive their own research into the application of predictive algorithms. - -*Experience:* This course assumes that you have a background in Python, R, or both coding languages. - -The current implementation of the 400-level course is a deeper dive of the content in the 300-level course. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/kernels.adoc b/projects-appendix/modules/ROOT/pages/kernels.adoc deleted file mode 100644 index ea634991a..000000000 --- a/projects-appendix/modules/ROOT/pages/kernels.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= Kernels - -Most of the time, Jupyter Lab will be used with the `seminar` or `seminar-r` kernel. By default, the `seminar` kernel runs Python code and `seminar-r` kernel runs R code. To run other types of code, see below. Any format or template related questions should be asked in Piazza. - -== Running `Python` code using the `seminar` kernel - -[source,python] ----- -import pandas as pd -myDF = pd.read_csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -myDF.head() ----- - - -++++ - -++++ - -== Running `R` code using the `seminar` kernel or the `seminar-r` kernel - -Using the `seminar` kernel with R, it is necessary to use the `%%R` cell magic: - -[source,R] ----- -%%R - -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -head(myDF) ----- - -Using the `seminar-r` kernel with R, it is NOT necessary to use the `%%R` cell magic: - -[source,R] ----- -myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv") -head(myDF) ----- - -++++ - -++++ - -As you can see, any cell that begins with `%%R` and uses the `seminar` kernel will run the R code in that cell. Alternatively, using the `seminar-r` kernel, it is possible to run R code without using the `%%R` cell magic. - -== Running SQL queries using the `seminar` kernel - -. First, you need to establish a connection with the database. If this is a sqlite database, you can use the following command. -+ -[source,ipython] ----- -%sql sqlite:///my_db.db -# or -%sql sqlite:////absolute/path/to/my_db.db -# like this -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- -+ -Otherwise, if this is a mysql database, you can use the following command. -+ -[source,ipython] ----- -%sql mariadb+pymysql://username:password@my_url.com/my_database ----- -+ -. Next, we can run SQL queries, in a new cell, as shown with the following example, in which we show the first 5 lines of the `titles` table. -+ -[source,ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -++++ - -++++ - -As you can see, any cell that begins with `%%sql` will run the SQL query in that cell. If a cell does not begin with `%%sql`, it will be assumed that the code is Python code, and run accordingly. - -== Running `bash` code using the `seminar` kernel - -To run `bash` code, in a new cell, run the following. - -[source,bash] ----- -%%bash - -head /anvil/projects/tdm/data/flights/subset/airports.csv ----- - -++++ - -++++ - -As you can see, any cell that begins with `%%bash` will run the `bash` code in that cell. If a cell does not begin with `%%bash`, it will be assumed that the code is Python code, and run accordingly. - -[TIP] -==== -Code cells that start with `%` or `%%` are sometimes referred to as magic cells. To see a list of available magics, run `%lsmagic` in a cell. - -The commands listed in the "cell" section are run with a double `%%` and apply to the entire cell, rather than just a single line. For example, `%%bash` is an example of a cell magic. - -You can read more about some of the available magics in the https://ipython.readthedocs.io/en/stable/interactive/magics.html#[official documentation]. -==== - -== Including an image in your notebook - -To include an image in your notebook, use the following Python code. - -[source,python] ----- -from IPython import display -display.Image("/anvil/projects/tdm/data/images/woodstock.png") ----- - -Here, `/anvil/projects/tdm/data/images/woodstock.png` is the path to the image you would like to include. - -++++ - -++++ - - -[IMPORTANT] -==== -If you choose to include an image using a Markdown cell, and the `![](...)` syntax, please note that while the notebook will render properly in our https://ondemand.anvil.rcac.purdue.edu environment, it will _not_ load properly in any other environment where that image is not available. For this reason it is critical to include images using the method shown here. -==== diff --git a/projects-appendix/modules/ROOT/pages/progress-table.adoc b/projects-appendix/modules/ROOT/pages/progress-table.adoc deleted file mode 100644 index f47fa6e5c..000000000 --- a/projects-appendix/modules/ROOT/pages/progress-table.adoc +++ /dev/null @@ -1,80 +0,0 @@ -// copy/paste these for project status as needed -// Incomplete {set:cellbgcolor:#e03b24} -// Team Review {set:cellbgcolor:#ffcc00} -// Final Review {set:cellbgcolor:#64a338} - -## TDM 101 -|=== -| Project Name {set:cellbgcolor:} | Completion Deadline | Status -| Project 1 {set:cellbgcolor:} | 2024-05-08 | Final Review {set:cellbgcolor:#64a338} -| Project 2 {set:cellbgcolor:} | 2024-05-10 | Final Review {set:cellbgcolor:#64a338} -| Project 3 {set:cellbgcolor:} | 2024-05-14 | Final Review {set:cellbgcolor:#64a338} -| Project 4 {set:cellbgcolor:} | 2024-05-16 | Final Review {set:cellbgcolor:#64a338} -| Project 5 {set:cellbgcolor:} | 2024-05-20 | Final Review {set:cellbgcolor:#64a338} -| Project 6 {set:cellbgcolor:} | 2024-05-22 | Final Review {set:cellbgcolor:#64a338} -| Project 7 {set:cellbgcolor:} | 2024-05-24 | Final Review {set:cellbgcolor:#64a338} -| Project 8 {set:cellbgcolor:} | 2024-05-30 | Final Review {set:cellbgcolor:#64a338} -| Project 9 {set:cellbgcolor:} | 2024-06-03 | Final Review {set:cellbgcolor:#64a338} -| Project 10 {set:cellbgcolor:} | 2024-06-05 | Final Review {set:cellbgcolor:#64a338} -| Project 11 {set:cellbgcolor:} | 2024-06-07 | Final Review {set:cellbgcolor:#64a338} -| Project 12 {set:cellbgcolor:} | 2024-06-11 | Final Review {set:cellbgcolor:#64a338} -| Project 13 {set:cellbgcolor:} | 2024-06-13 | Final Review {set:cellbgcolor:#64a338} -| Project 14 {set:cellbgcolor:} | 2024-06-18 | Final Review {set:cellbgcolor:#64a338} -|=== - -## TDM 201 -|=== -| Project Name {set:cellbgcolor:} | Completion Deadline | Status -| Project 1 {set:cellbgcolor:} | 2024-07-08 | Final Review {set:cellbgcolor:#64a338} -| Project 2 {set:cellbgcolor:} | 2024-07-11 | Final Review {set:cellbgcolor:#64a338} -| Project 3 {set:cellbgcolor:} | 2024-07-16 | Final Review {set:cellbgcolor:#64a338} -| Project 4 {set:cellbgcolor:} | 2024-07-19 | Final Review {set:cellbgcolor:#64a338} -| Project 5 {set:cellbgcolor:} | 2024-07-22 | Team Review {set:cellbgcolor:#ffcc00} -| Project 6 {set:cellbgcolor:} | 2024-07-23 | Incomplete {set:cellbgcolor:#e03b24} -| Project 7 {set:cellbgcolor:} | 2024-07-26 | Incomplete {set:cellbgcolor:#e03b24} -| Project 8 {set:cellbgcolor:} | 2024-07-30 | Incomplete {set:cellbgcolor:#e03b24} -| Project 9 {set:cellbgcolor:} | 2024-08-07 | Incomplete {set:cellbgcolor:#e03b24} -| Project 10 {set:cellbgcolor:} | 2024-08-12 | Incomplete {set:cellbgcolor:#e03b24} -| Project 11 {set:cellbgcolor:} | 2024-08-15 | Incomplete {set:cellbgcolor:#e03b24} -| Project 12 {set:cellbgcolor:} | 2024-08-20 | Incomplete {set:cellbgcolor:#e03b24} -| Project 13 {set:cellbgcolor:} | 2024-08-23 | Incomplete {set:cellbgcolor:#e03b24} -| Project 14 {set:cellbgcolor:} | 2024-08-23 | Incomplete {set:cellbgcolor:#e03b24} -|=== - -## TDM 301 -|=== -| Project Name {set:cellbgcolor:} | Completion Deadline | Status -| Project 1 {set:cellbgcolor:} | 2024-07-08 | Final Review {set:cellbgcolor:#64a338} -| Project 2 {set:cellbgcolor:} | 2024-07-08 | Final Review {set:cellbgcolor:#64a338} -| Project 3 {set:cellbgcolor:} | 2024-07-15 | Final Review {set:cellbgcolor:#64a338} -| Project 4 {set:cellbgcolor:} | 2024-07-15 | Final Review {set:cellbgcolor:#64a338} -| Project 5 {set:cellbgcolor:} | 2024-07-22 | Final Review {set:cellbgcolor:#64a338} -| Project 6 {set:cellbgcolor:} | 2024-07-22 | Final Review {set:cellbgcolor:#64a338} -| Project 7 {set:cellbgcolor:} | 2024-07-29 | Final Review {set:cellbgcolor:#64a338} -| Project 8 {set:cellbgcolor:} | 2024-07-29 | Final Review {set:cellbgcolor:#64a338} -| Project 9 {set:cellbgcolor:} | 2024-08-05 | Team Review {set:cellbgcolor:#ffcc00} -| Project 10 {set:cellbgcolor:} | 2024-08-05 | Incomplete {set:cellbgcolor:#e03b24} -| Project 11 {set:cellbgcolor:} | 2024-08-12 | Incomplete {set:cellbgcolor:#e03b24} -| Project 12 {set:cellbgcolor:} | 2024-08-12 | Incomplete {set:cellbgcolor:#e03b24} -| Project 13 {set:cellbgcolor:} | 2024-08-19 | Incomplete {set:cellbgcolor:#e03b24} -| Project 14 {set:cellbgcolor:} | 2024-08-19 | Incomplete {set:cellbgcolor:#e03b24} -|=== - -## TDM 401 -|=== -| Project Name {set:cellbgcolor:} | Completion Deadline | Status -| Project 1 {set:cellbgcolor:} | 2024-07-08 | Final Review {set:cellbgcolor:#64a338} -| Project 2 {set:cellbgcolor:} | 2024-07-08 | Final Review {set:cellbgcolor:#64a338} -| Project 3 {set:cellbgcolor:} | 2024-07-15 | Final Review {set:cellbgcolor:#64a338} -| Project 4 {set:cellbgcolor:} | 2024-07-15 | Final Review {set:cellbgcolor:#64a338} -| Project 5 {set:cellbgcolor:} | 2024-07-22 | Final Review {set:cellbgcolor:#64a338} -| Project 6 {set:cellbgcolor:} | 2024-07-22 | Final Review {set:cellbgcolor:#64a338} -| Project 7 {set:cellbgcolor:} | 2024-07-29 | Final Review {set:cellbgcolor:#64a338} -| Project 8 {set:cellbgcolor:} | 2024-07-29 | Final Review {set:cellbgcolor:#64a338} -| Project 9 {set:cellbgcolor:} | 2024-08-05 | Team Review {set:cellbgcolor:#ffcc00} -| Project 10 {set:cellbgcolor:} | 2024-08-05 | Incomplete {set:cellbgcolor:#e03b24} -| Project 11 {set:cellbgcolor:} | 2024-08-12 | Incomplete {set:cellbgcolor:#e03b24} -| Project 12 {set:cellbgcolor:} | 2024-08-12 | Incomplete {set:cellbgcolor:#e03b24} -| Project 13 {set:cellbgcolor:} | 2024-08-19 | Incomplete {set:cellbgcolor:#e03b24} -| Project 14 {set:cellbgcolor:} | 2024-08-19 | Incomplete {set:cellbgcolor:#e03b24} -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project01.adoc deleted file mode 100644 index 5b0e9279c..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project01.adoc +++ /dev/null @@ -1,203 +0,0 @@ -= STAT 19000: Project 1 -- Spring 2021 - -**Motivation:** In this course we require the majority of project submissions to include a compiled PDF, a .Rmd file based off of https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/project_template.Rmd[our template], and a code file (a .R file if the project is in R, a .py file if the project is in Python). Although RStudio makes it easy to work with both Python and R, there are occasions where working out a Python problem in a Jupyter Notebook could be convenient. For that reason, we will introduce Jupyter Notebook in this project. - -**Context:** This is the first in a series of projects that will introduce Python and its tooling to students. - -**Scope:** jupyter notebooks, rstudio, python - -.Learning objectives -**** -- Use Jupyter Notebook to run Python code and create Markdown text. -- Use RStudio to run Python code and compile your final PDF. -- Gain exposure to Python control flow and reading external data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/open_food_facts/openfoodfacts.tsv` - -== Questions - -=== Question 1 - -Navigate to https://notebook.scholar.rcac.purdue.edu/ and sign in with your Purdue credentials (_without_ BoilerKey). This is an instance of Jupyter Notebook. The main screen will show a series of files and folders that are in your `$HOME` directory. Create a new notebook by clicking on `New > f2020-s2021`. - -Change the name of your notebook to "LASTNAME_FIRSTNAME_project01" where "LASTNAME" is your family name, and "FIRSTNAME" is your given name. Try to export your notebook (using the `File` dropdown menu, choosing the option `Download as`), what format options (for example, `.pdf`) are available to you? - -[IMPORTANT] -==== -`f2020-s2021` is the name of our course notebook kernel. A notebook kernel is an engine that runs code in a notebook. ipython kernels run Python code. `f2020-s2021` is an ipython kernel that we've created for our course Python environment, which contains a variety of compatible, pre-installed packages for you to use. When you select `f2020-s2021` as your kernel, all of the packages in our course environment are automatically made available to you. -==== - -++++ - -++++ - -If the kernel `f2020-s2021` does not appear in Jupyter Notebooks, you can make it appear as follows: - -* Login to https://rstudio.scholar.rcac.purdue.edu -* Click on `Tools > Shell...` (in the menu) -* In the shell (terminal looking thing that should say something like: `bash-4.2$`), type the following followed by Enter/Return: `/class/datamine/apps/runme` -* Then click on `Session > Restart R` (in the menu) -You should now have access to the course kernel named `f2020-s2021` in https://notebook.scholar.rcac.purdue.edu - -.Items to submit -==== -- A list of export format options. -==== - -=== Question 2 - -Each "box" in a Jupyter Notebook is called a _cell_. There are two primary types of cells: code, and markdown. By default, a cell will be a code cell. Place the following Python code inside the first cell, and run the cell. What is the output? - -[source,python] ----- -from thedatamine import hello_datamine -hello_datamine() ----- - -[TIP] -==== -You can run the code in the currently selected cell by using the GUI (the buttons), as well as by pressing `Ctrl+Return/Enter`. -==== - -.Items to submit -==== -- Output from running the provided code. -==== - -=== Question 3 - -Jupyter Notebooks allow you to easily pull up documentation, similar to `?function` in R. To do so, use the `help` function, like this: `help(my_function)`. What is the output from running the help function on `hello_datamine`? Can you modify the code from question (2) to print a customized message? Create a new _markdown_ cell and explain what you did to the code from question (2) to make the message customized. - -[IMPORTANT] -==== -Some Jupyter-only methods to do this are: - -- Click on the function of interest and type `Shift+Tab` or `Shift+Tab+Tab`. -- Run `function?`, for example, `print?`. -==== - -[IMPORTANT] -==== -You can also see the source code of a function in a Jupyter Notebook by typing `function??`, for example, `print??`. -==== - -.Items to submit -==== -- Output from running the `help` function on `hello_datamine`. -- Modified code from question (2) that prints a customized message. -==== - -=== Question 4 - -At this point in time, you've now got the basics of running Python code in Jupyter Notebooks. There is really not a whole lot more to it. For this class, however, we will continue to create RMarkdown documents in addition to the compiled PDFs. You are welcome to use Jupyter Notebooks for personal projects or for testing things out, however, we will still require an RMarkdown file (.Rmd), PDF (generated from the RMarkdown file), and .py file (containing your python code). For example, please move your solutions from Questions 1, 2, 3 from Jupyter Notebooks over to RMarkdown (we discuss RMarkdown below). Let's learn how to run Python code chunks in RMarkdown. - -Sign in to https://rstudio.scholar.rcac.purdue.edu (_with_ BoilerKey). Projects in The Data Mine should all be submitted using our template found https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/project_template.Rmd[here] or on Scholar (`/class/datamine/apps/templates/project_template.Rmd`). - -Open the project template and save it into your home directory, in a new RMarkdown file named `project01.Rmd`. Prior to running any Python code, run `datamine_py()` in the R console, just like you did at the beginning of every project from the first semester. - -Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the _engine_ in the curly braces "{" and "}". As you can see, it is possible to mix and match different languages just by changing the engine. Move the solutions for questions 1-3 to your `project01.Rmd`. Make sure to place all Python code in `python` code chunks. Run the `python` code chunks to ensure you get the same results as you got when running the Python code in a Jupyter Notebook. - -[IMPORTANT] -==== -Make sure to run `datamine_py()` in the R console prior to attempting to run any Python code. -==== - -[TIP] -==== -The end result of the `project01.Rmd` should look _similar_ to https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/example02.Rmd[this]. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- `project01.Rmd` with the solutions from questions 1-3 (including any Python code in `python` code chunks). -==== - -=== Question 5 - -It is not a Data Mine project without data! [Here] (#p-csv-pkg) are some examples of reading in data line by line using the `csv` package. How many columns are in the following dataset: `/class/datamine/data/open_food_facts/openfoodfacts.tsv`? Print the first row, the number of columns, and then exit the loop after the first iteration using the `break` keyword. - -[TIP] -==== -You can get the number of elements in a list by using the `len` method. For example: `len(my_list)`. -==== - -[TIP] -==== -You can use the `break` keyword to exit a loop. As soon as `break` is executed, the loop is exited and the code immediately following the loop is run. -==== - -[source,python] ----- -for my_row in my_csv_reader: - print(my_row) - break -print("Exited loop as soon as 'break' was run.") ----- - -[TIP] -==== -`'\t'` represents a tab in Python. -==== - -++++ - -++++ - -[IMPORTANT] -==== -If you get a Dtype warning, feel free to just ignore it. -==== - -.Items to submit -==== -- Python code used to solve this problem. -- The first row printed, and the number of columns printed. -==== - -=== Question 6 (OPTIONAL) - -Unlike in R, where many of the tools you need are built-in (`read.csv`, data.frames, etc.), in Python, you will need to rely on packages like `numpy` and `pandas` to do the bulk of your data science work. {#p1-06} - -In R it would be really easy to find the mean of the 151st column, `caffeine_100g`: - -[source,r] ----- -myDF <- read.csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t", quote="") -mean(myDF$caffeine_100g, na.rm=T) # 2.075503 ----- - -If you were to try to modify our loop from question (5) to do the same thing, you will run into a myriad of issues, just to try and get the mean of a column. Luckily, it is easy to do using `pandas`: - -[source,python] ----- -import pandas as pd -myDF = pd.read_csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t") -myDF["caffeine_100g"].mean() # 2.0755028571428573 ----- - -Take a look at some of the methods you can perform using pandas https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats]. Perform an interesting calculation in R, and replicate your work using `pandas`. Which did you prefer, Python or R? - -++++ - -++++ - -.Items to submit -==== -- R code used to solve the problem. -- Python code used to solve the problem. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project02.adoc deleted file mode 100644 index 0007c1737..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project02.adoc +++ /dev/null @@ -1,152 +0,0 @@ -= STAT 19000: Project 2 -- Spring 2021 - -**Motivation:** In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A `character` in R is similar to a `str` in Python. An `integer` in R is an `int` in Python. A `numeric` in R is similar to a `float` in Python. A `logical` in R is similar to a `bool` in Python. In addition to all of that, there are some very popular classes that packages like `numpy` and `pandas` introduces. On the other hand, there are some data types in Python like `tuple`s, `list`s, `set`s, and `dict`s that diverge from R a little bit more. It is integral to understand some basic concepts before jumping too far into everything. - -**Context:** This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset. - -**Scope:** tuples, lists, if statements, opening files - -.Learning Objectives -**** -- List the differences between lists & tuples and when to use each. -- Gain familiarity with string methods, list methods, and tuple methods. -- Demonstrate the ability to read and write data of various formats using various packages. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/craigslist/vehicles.csv` - -== Questions - -=== Question 1 - -Read in the dataset `/class/datamine/data/craigslist/vehicles.csv` into a `pandas` DataFrame called `myDF`. `pandas` is an integral tool for various data science tasks in Python. You can read a quick intro https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[here]. We will be slowly introducing bits and pieces of this package throughout the semester. Similarly, we will try to introduce byte-sized (ha!) portions of plotting packages to slowly build up your skills. - -*How big is the dataset (in Mb or Gb)?* - -https://mediaspace.itap.purdue.edu/id/1_1bhwhkt2[Click here for video] - -[NOTE] -==== -If you didn't do [optional question 6 in project 1](#p1-06), we would recommend taking a look. -==== - -[TIP] -==== -Remember to check out a question's _relevant topics_. We try very hard to link you to content and examples that will get you up and running as _quickly_ as possible. -==== - -.Items to submit -==== -- Python code used to solve the problem. -==== - -=== Question 2 - -In Question 1, we read our data into a `pandas` DataFrame. Use one of the `pandas` DataFrame https://pandas.pydata.org/docs/reference/frame.html#attributes-and-underlying-data[attributes] to get the number of columns and rows of our dataset. How many columns and rows are there? Use f-strings to print a message, for example: - -```` -There are 123 columns in the DataFrame! -There are 321 rows in the DataFrame! -```` - -In project 1, we learned how to read a csv file in, line-by-line, and print values. Use the `csv` package to print _just_ the first row, which should contain the names of the columns, OR instead of using the `csv` package, use one of the `pandas` attributes from `myDF` (to print the column names). - -https://mediaspace.itap.purdue.edu/id/1_cifzobbk[Click here for video] - -.Items to submit -==== -- The output from printing the f-strings. -- Python code used to solve the problem. -==== - -=== Question 3 - -Use the `csv` or `pandas` package to get a xref:programming-languages:python:lists.adoc[list] called `our_columns` that contains the column names. Add a string, "extra", to the end of `our_columns`. Print the second value in the list. Without using a loop, print the 1st, 3rd, 5th, etc. elements of the list. Print the last four elements of the list ( "state", "lat", "long", and "extra") by accessing their negative index. - -"extra" doesn't belong in our list, you can easily remove this value from our list by doing the following... - -[source,python] ----- -our_columns.pop(25) -# or even this, as pop removes the last value by default -our_columns.pop() ----- - -BUT the problem with this solution is that you must know the index of the value you'd like to remove, and sometimes you do not know the index of the value. Instead, please show how to use a list method to remove "extra" by _value_ rather than by _index_. - -https://mediaspace.itap.purdue.edu/id/1_1z6kxfn1[Click here for video] - -.Items to submit -==== -- Python code used to solve the problem. -- The output from running your code. -==== - -=== Question 4 - -`matplotlib` is one of the primary plotting packages in Python. You are provided with the following code: - -[source,python] ----- -my_values = tuple(myDF.loc[:, 'odometer'].dropna().to_list()) ----- - -The result is a _tuple_ containing the odometer readings from all of the vehicles in our dataset. Create a lineplot of the odometer readings. - -Well, that plot doesn't seem too informative. Let's first sort the values in our tuple: - -[source,python] ----- -my_values.sort() ----- - -What happened? A tuple is immutable. What this means is that once the contents of a tuple are declared they cannot be modified. For example: - -[source,python] ----- -# This will fail because tuples are immutable -my_values[0] = 100 ----- - -You can read a good article about this http://www.compciv.org/guides/python/fundamentals/tuples-immutable/[here]. In addition, https://stackoverflow.com/questions/1708510/list-vs-tuple-when-to-use-each[here] is a great post that gives you an idea when using a tuple might be a good idea. Okay, so let's go back to our problem. We know that lists _are_ mutable (and therefore sortable), so convert `my_values` to a list and then sort, and re-plot. - -It looks like there are some (potential) outliers that are making our plot look a little wonky. For the sake of seeing how the plot would look, use negative indexing to plot the sorted values _minus_ the last 50 values (the 50 highest values). New new plot may not look _that_ different, that is okay. - -[TIP] -==== -To prevent plotting values on the same plot, close your plot with the `close` method, for example: -==== - -[source,python] ----- -import matplotlib.pyplot as plt -my_values = [1,2,3,4,5] -plt.plot(my_values) -plt.show() -plt.close() ----- - -.Items to submit -==== -- Python code used to solve the problem. -- The output from running your code. -==== - -=== Question 5 - -We've covered a lot in this project! Use what you've learned so far to do one (or more) of the following tasks: - -- Create a cool graphic using `matplotlib`, that summarizes some data from our dataset. -- Use `pandas` and your investigative skills to sift through the dataset and glean an interesting factoid. -- Create some commented coding examples that highlight the differences between lists and tuples. Include at least 3 examples. - -.Items to submit -==== -- Python code used to solve the problem. -- The output from running your code. diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project03.adoc deleted file mode 100644 index f22de333b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project03.adoc +++ /dev/null @@ -1,298 +0,0 @@ -= STAT 19000: Project 3 -- Spring 2021 - -**Motivation:** A dictionary (referred to as a `dict`) is one of the most useful data structures in Python. You can think about them as a data structure containing _key_: _value_ pairs. Under the hood, a `dict` is essentially a data structure called a _hash table_. https://en.wikipedia.org/wiki/Hash_table[Hash tables] are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time, meaning that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will _usually_ take about the same amount of time. (The worst case time increases linearly.) Dictionaries (`dict`) are used a lot, so it is worthwhile to understand them. Although not used quite as often, another important data type called a `set`, is also worthwhile learning about. - -Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the `get` method, the other is to use square brackets and strings. Test out the following to understand the differences between the two: - -[source,python] ----- -my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]} -# If "person" is indeed a key, they will function the same way -my_dict["person"] -my_dict.get("person") -# If the key does not exist, like below, they will not -# function the same way. -my_dict.get("height") # Returns None when key doesn't exist -print(my_dict.get("height")) # By printing, we can see None in this case -my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist ----- - -**Context:** In our third project, we introduce some basic data types, and we demonstrate some familiar control flow concepts, all while digging right into a dataset. Throughout the course, we will slowly introduce concepts from `pandas`, and popular plotting packages. - -**Scope:** dicts, sets, if/else statements, opening files, tuples, lists - -.Learning objectives -**** -- Explain what is a `dict` is and why it is useful. -- Understand how a `set` works and when it could be useful. -- List the differences between lists & tuples and when to use each. -- Gain familiarity with string methods, list methods, and tuple methods. -- Gain familiarity with dict methods. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/craigslist/vehicles.csv` - -== Questions - -=== Question 1 - -In project 2 we learned how to read in data using `pandas`. Read in the (`/class/datamine/data/craigslist/vehicles.csv`) dataset into a DataFrame called `myDF` using `pandas`. In R we can get a sneak peek at the data by doing something like: - -[source,r] ----- -head(myDF) # where myDF is a data.frame ----- - -There is a very similar (and aptly named method) in `pandas` that allows us to do the exact same thing with a `pandas` DataFrame. Get the `head` of `myDF`, and take a moment to consider how much time it would take to get this information if we didn't have this nice `head` method. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- The `head` of the dataset. -==== - -=== Question 2 - -Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the `get` method, the other is to use square brackets and strings. Test out the following to understand the differences between the two: - -[source,python] ----- -my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]} -# If "person" is indeed a key, they will function the same way -my_dict["person"] -my_dict.get("person") -# If the key does not exist, like below, they will not -# function the same way. -my_dict.get("height") # Returns None when key doesn't exist -print(my_dict.get("height")) # By printing, we can see None in this case -my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist ----- - -Look at the dataset. Create a dict called `my_dict` that contains key:value pairs where the keys are years, and the values are a single int representing the number of vehicles from that year on craigslist. Use the `year` column, a loop, and a dict to accomplish this. Print the dictionary. You can use the following code to extract the `year` column as a list. In the next project we will learn how to loop over `pandas` DataFrames. - -[TIP] -==== -If you get a `KeyError`, remember, you must declare each key value pair just like any other variable. Use the following code to initialize each `year` key to the value 0. - -[source,python] ----- -myyears = myDF['year'].dropna().to_list() -# get a list containing each unique year -unique_years = list(set(myyears)) -# for each year (key), initialize the value (value) to 0 -my_dict = {} -for year in unique_years: - my_dict[year] = 0 ----- - -Here are some of the results you should get: - -[source,python] ----- -print(my_dict[1912]) # 5 -print(my_dict[1982]) # 185 -print(my_dict[2014]) # 31703 ----- -==== - -[NOTE] -==== -There is a special kind of `dict` called a `defaultdict`, that allows you to give default values to a `dict`, giving you the ability to "skip" initialization. We will show you this when we release the solutions to this project! It is not required, but it is interesting to know about! -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- `my_dict` printed. -==== - -=== Question 3 - -After completing question (2) you can easily access the number of vehicles from a given year. For example, to get the number of vehicles on craigslist from 1912, just run: - -[source,python] ----- -my_dict[1912] -# or -my_dict.get(1912) ----- - -A `dict` stores its data in key:value pairs. Identify a "key" from `my_dict`, as well as the associated "value". As you can imagine, having data in this format can be very beneficial. One benefit is the ability to easily create a graphic using `matplotlib`. Use `matplotlib` to create a bar graph with the year on the x-axis, and the number of vehicles from that year on the y-axis. - -[IMPORTANT] -==== -If when you end up seeing something like ``, you should probably end the code chunk with `plt.show()` instead. What is happening is Python is trying to `print` the plot object. That text is the result. To instead display the plot you need to call `plt.show()`. -==== - -[TIP] -==== -To use `matplotlib`, first import it: - -[source,python] ----- -import matplotlib.pyplot as plt -# now you can use it, for example -plt.plot([1,2,3,1]) -plt.show() -plt.close() ----- -==== - -[TIP] -==== -The `keys` method and `values` method from `dict` could be useful here. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- The resulting plot. -- A sentence giving an example of a "key" and associated "value" from `my_dict` (e.g., a sentence explaining the 1912 example above). -==== - -=== Question 4 - -In the hint in question (2), we used a `set` to quickly get a list of unique years in a list. Some other common uses of a `set` are when you want to get a list of values that are in one list but not another, or get a list of values that are present in both lists. Examine the following code. You'll notice that we are looping over many values. Replace the code for each of the three examples below with code that uses *no* loops whatsoever. - -[source,python] ----- -listA = [1, 2, 3, 4, 5, 6, 12, 12] -listB = [2, 1, 7, 7, 7, 2, 8, 9, 10, 11, 12, 13] -# 1. values in list A but not list B -# values in list A but not list B -onlyA = [] -for valA in listA: - if valA not in listB and valA not in onlyA: - onlyA.append(valA) -print(onlyA) # [3, 4, 5, 6] -# 2. values in listB but not list A -onlyB = [] -for valB in listB: - if valB not in listA and valB not in onlyB: - onlyB.append(valB) -print(onlyB) # [7, 8, 9, 10, 11, 13] -# 3. values in both lists -# values in both lists -in_both_lists = [] -for valA in listA: - if valA in listB and valA not in in_both_lists: - in_both_lists.append(valA) -print(in_both_lists) # [1,2,12] ----- - -[TIP] -==== -You should use a `set`. -==== - -[NOTE] -==== -In addition to being easier to read, using a `set` is _much_ faster than loops! -==== - -[NOTE] -==== -A set is a group of values that are unordered, unchangeable, and no duplicate values are allowed. While they aren't used a _lot_, they can be useful for a few common tasks like: removing duplicate values efficiently, efficiently finding values in one group of values that are not in another group of values, etc. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- The output from running the code. -==== - -=== Question 5 - -The value of a dictionary does not have to be a single value (like we've shown so far). It can be _anything_. Observe that there is latitude and longitude data for each row in our DataFrame (`lat` and `long`, respectively). Wouldn't it be useful to be able to quickly "get" pairs of latitude and longitude data for a given state? - -First, run the following code to get a list of tuples where the first value is the `state`, the second value is the `lat`, and the third value is the `long`. - -[source,python] ----- -states_list = list(myDF.loc[:, ["state", "lat", "long"]].dropna().to_records(index=False)) -states_list[0:3] # [('az', 34.4554, -114.269), ('or', 46.1837, -123.824), ('sc', 34.9352, -81.9654)] -# to get the first tuple -states_list[0] # ('az', 34.4554, -114.269) -# to get the first value in the first tuple -states_list[0][0] # az -# to get the second tuple -states_list[1] # ('or', 46.1837, -123.824) -# to get the first value in the second tuple -states_list[1][0] # or ----- - -[TIP] -==== -If you have an issue where you cannot append values to a specific key, make sure to first initialize the specific key to an empty list so the append method is available to use. -==== - -Now, organize the latitude and longitude data in a dictionary called `geoDict` such that each state from the `state` column is a key, and the respective value is a list of tuples, where the first value in each tuple is the latitude (`lat`) and the second value is the longitude (`long`). For example, the first 2 (lat,long) pairs in Indiana (`"in"`) are: - -[source,python] ----- -geoDict.get("in")[0:2] # [(39.0295, -86.8675), (38.8585, -86.4806)] -len(geoDict.get("in")) # 5687 ----- - -++++ - -++++ - -Now that you can easily access latitude and longitude pairs for a given state, run the following code to plot the points for Texas (the `state` value is `"tx"`). Include the the graphic produced below in your solution, but feel free to experiment with other states. - -[NOTE] -==== -You do NOT need to include this portion of Question 5 in your Markdown `.Rmd` file. We cannot get this portion to build in Markdown, but please do include it in your Python `.py` file. - -[source,python] ----- -from shapely.geometry import Point -import geopandas as gpd -from geopandas import GeoDataFrame -usa = gpd.read_file('/anvil/projects/tdm/data/boundaries/cb_2018_us_state_20m.shp') -usa.crs = {'init': 'epsg:4269'} -pts = [Point(y,x) for x, y in geoDict.get("tx")] -gdf = gpd.GeoDataFrame(geometry=pts, crs = 4269) -fig, gax = plt.subplots(1, figsize=(10,10)) -base = usa[usa['NAME'].isin(['Hawaii', 'Alaska', 'Puerto Rico']) == False].plot(ax=gax, color='white', edgecolor='black') -gdf.plot(ax=base, color='darkred', marker="*", markersize=10) -plt.show() -plt.close() -# to save to jpg: -plt.savefig('q5.jpg') ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Graphic file (`q5.jpg`) produced for the given state. -==== - -=== Question 6 - -Use your new skills to extract some sort of information from our dataset and create a graphic. This can be as simple or complicated as you are comfortable with! - -.Items to submit -==== -- Python code used to solve the problem. -- The graphic produced using the code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project04.adoc deleted file mode 100644 index 0b2368b2a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project04.adoc +++ /dev/null @@ -1,294 +0,0 @@ -= STAT 19000: Project 4 -- Spring 2021 - -**Motivation:** We've now been introduced to a variety of core Python data structures. Along the way we've touched on a bit of `pandas`, `matplotlib`, and have utilized some control flow features like for loops and if statements. We will continue to touch on `pandas` and `matplotlib`, but we will take a deeper dive in this project and learn more about control flow, all while digging into the data! - -**Context:** We just finished a project where we were able to see the power of dictionaries and sets. In this project we will take a step back and make sure we are able to really grasp control flow (if/else statements, loops, etc.) in Python. - -**Scope:** python, dicts, lists, if/else statements, for loops - -.Learning objectives -**** -- List the differences between lists & tuples and when to use each. -- Explain what is a dict and why it is useful. -- Demonstrate a working knowledge of control flow in python: if/else statements, while loops, for loops, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/craigslist/vehicles.csv` - -== Questions - -=== Question 1 - -Unlike in R, where traditional loops are rare and typically accomplished via one of the apply functions, in Python, loops are extremely common and important to understand. In Python, any iterator can be looped over. Some common iterators are: tuples, lists, dicts, sets, `pandas` Series, and `pandas` DataFrames. In the previous project we had some examples of looping over lists, let's learn how to loop over `pandas` Series and Dataframes! - -Load up our dataset `/class/datamine/data/craigslist/vehicles.csv` into a DataFrame called `myDF`. In project (3), we organized the latitude and longitude data in a dictionary called `geoDict` such that each state from the `state` column is a key, and the respective value is a list of tuples, where the first value in each tuple is the latitude (`lat`) and the second value is the longitude (`long`). Repeat this question, but **do not** use lists, instead use `pandas` to accomplish this. - -[TIP] -==== -The data frame has 435,849 rows, and it takes forever to accomplish this with `pandas`. We just want you to do this one time, to see how slow this is. Try it first with only 10 rows, and then with 100 rows, and once you are sure it is working, try it with (say) 20,000 rows. You do not need to do this with the entire data frame. It takes too long! -==== - -++++ - -++++ - -Here is a video about the new feature to reset your RStudio session if you make a big mistake or if your session is very slow: - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Wow! The solution to question (1) was _slow_. In general, you'll want to avoid looping over large DataFrames. https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758[Here] is a pretty good explanation of why, as well as a good system on what to try when computing something. In this case, we could have used indexing to get latitude and longitude values for each state, and would have no need to build this dict. - -The method we learned in Project 3::Question 5 is faster and easier! Just in case you did not solve Project 3::Question 5, here is a fast way to build `geoDict`: - -[source,python] ----- -import pandas as pd -myDF = pd.read_csv("/class/datamine/data/craigslist/vehicles.csv") -states_list = list(myDF.loc[:, ["state", "lat", "long"]].dropna().to_records(index=False)) -geoDict = {} -for mytriple in states_list: - geoDict[mytriple[0]] = [] -for mytriple in states_list: - geoDict[mytriple[0]].append( (mytriple[1],mytriple[2]) ) ----- - -Now we will practice iterating over a dictionary, list, _and_ tuple, all at once! Loop through `geoDict` and use f-strings to print the state abbreviation. Print the first latitude and longitude pair, as well as every 5000th latitude and longitude pair for each state. Round values to the hundreths place. For example, if the state was "pu", and it had 12000 latitude and longitude pairs, we would print the following: - ----- -pu: -Lat: 41.41, Long: 41.41 -Lat: 22.21, Long: 21.21 -Lat: 11.11, Long: 10.22 ----- - -In the above example, `Lat: 41.41, Long: 41.41` would be the 0th pair, `Lat: 22.21, Long: 21.21` would be the 5000th pair, and `Lat: 11.11, Long: 10.22` would be the 10000th pair. Make sure to use f-strings to round the latitude and longitude values to two decimal places. - -There are several ways to solve this question. You can use whatever method is easiest for you, but please be sure (as always) to add comments to explain your method of solution. - -++++ - -++++ - -[TIP] -==== -`Enumerate` is a useful function that adds an index to our loop in Python. -==== - -[TIP] -==== -Using an if statement and the https://www.jquery-az.com/python-modulo/[modulo operator] could be useful. -==== - -[NOTE] -==== -Whenever we have a loop _within_ another loop, the "inner" loop is called a "nested" loop, as it is "nested" inside of the other. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -We are curious about how the year of the car (`year`) effects the price (`price`). In R, we could get the median price by year easily, using `tapply`: - -[source,r] ----- -tapply(myDF$price, myDF$year, median, na.rm=T) ----- - -Using `pandas`, we would do this: - -[source,python] ----- -res = myDF.groupby(['year'], dropna=True).median() ----- - -These are very convenient functions that do a lot of work for you. If we were to take a look at the median price of cars by year, it would look like: - -[source,python] ----- -import matplotlib.pyplot as plt -res = myDF.groupby(['year'], dropna=True).median()["price"] -plt.bar(res.index, res.values) ----- - -Using the content of the variable `my_list` provided in the code below, calculate the median car price per year without using the `median` function and without using a `sort` function. Use only dictionaries, for loops and if statements. Replicate the plot generated by running the code above (you can use the plot to make sure it looks right). - -[source,python] ----- -my_list = list(myDF.loc[:, ["year", "price",]].dropna().to_records(index=False)) ----- - -++++ - -++++ - -[TIP] -==== -If you do not want to write your own median function to find the median, then it is OK to just use the `getMid` function [found here](#p-median) or to use a median function from elsewhere on the web. Just be sure to cite your source, if you do use a median function that someone else provides or that you use from the internet. There are many small variations on median functions, especially when it comes to (for instance) lists with even length. -==== - -[TIP] -==== -It is also OK to use: `import statistics` and the function `statistics.median` -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The barplot. -==== - -=== Question 4 - -Now calculate the mean `price` by `year`(still not using pandas code), and create a barplot with the `price` on the y-axis and `year` on the x-axis. Whoa! Something is odd here. Explain what is happening. Modify your code to use an if statement to "weed out" the likely erroneous value. Re-plot your values. - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -It is also OK to use a built-in `mean` function, for instace: `import statistics` and the function `statistics.mean` -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The barplot. -==== - -=== Question 5 - -List comprehensions are a neat feature of Python that allows for a more concise syntax for smaller loops. While at first they may seem difficult and more confusing, eventually they grow on you. For example, say you wanted to capitalize every `state` in a list full of states: - -[source,python] ----- -my_states = myDF['state'].to_list() -my_states = [state.upper() for state in my_states] ----- - -Or, maybe you wanted to find the average price of cars in "excellent" condition (without `pandas`): - -[source,python] ----- -my_list = list(myDF.loc[:, ["condition", "price",]].dropna().to_records(index=False)) -my_list = [price for (condition, price) in my_list if condition == "excellent"] -sum(my_list)/len(my_list) ----- - -Do the following using list comprehensions, and the provided code: - -[source,python] ----- -my_list = list(myDF.loc[:, ["state", "price",]].dropna().to_records(index=False)) ----- - -- Calculate the average price of vehicles from Indiana (`in`). -- Calculate the average price of vehicles from Indiana (`in`), Michigan (`mi`), and Illinois (`il`) combined. - -[source,python] ----- -my_list = list(myDF.loc[:, ["manufacturer", "year", "price",]].dropna().to_records(index=False)) ----- - -- Calculate the average price of a "honda" (`manufacturer`) that is 2010 or newer (`year`). - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 6 - -Let's use a package called `spacy` to try and parse phone numbers out of the `description` column. First, simply loop through and print the text and the label. What is the label of the majority of the phone numbers you can see? - -[source,python] ----- -import spacy -# get list of descriptions -my_list = list(myDF.loc[:, ["description",]].dropna().to_records(index=False)) -my_list = [m[0] for m in my_list] -# load the pre-built spacy model -nlp = spacy.load("en_core_web_lg") -# apply the model to a description -doc = nlp(my_list[0]) -# print the text and label of each "entity" -for entity in doc.ents: - print(entity.text, entity.label_) ----- - -Use an if statement to filter out all entities that are not the label you see. Loop through again and see what our printed data looks like. There is still a lot of data there that we _don't_ want to capture, right? Phone numbers in the US are _usually_ 7 (5555555), 8 (555-5555), 10 (5555555555), 11 (15555555555), 12 (555-555-5555), or 14 (1-555-555-5555) digits. In addition to your first "filter", add another "filter" that keeps only text where the text is one of those lengths. - -That is starting to look better, but there are still some erroneous values. Come up with another "filter", and loop through our data again. Explain what your filter does and make sure that it does a better job on the first 10 documents than when we don't use your filter. - -[NOTE] -==== -If you get an error when trying to knit that talks about "unicode" characters, this is caused by trying to print special characters (non-ascii). An easy fix is just to remove all non-ascii text. You can do this with the `encode` string method. For example: -==== - -Instead of: - -[source,python] ----- -for entity in doc.ents: - print(entity.text, entity.label_) ----- - -Do: - -[source,python] ----- -for entity in doc.ents: - print(entity.text.encode('ascii', errors='ignore'), entity.label_) ----- - -++++ - -++++ - -[NOTE] -==== -It can be fun to utilize machine learning and natural language processing, but that doesn't mean it is always the best solution! We could get rid of all of our filters and use regular expressions with much better results! We will demonstrate this in our solution. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining what your filter does. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project05.adoc deleted file mode 100644 index cd00e58a9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project05.adoc +++ /dev/null @@ -1,190 +0,0 @@ -= STAT 19000: Project 5 -- Spring 2021 - -**Motivation:** Up until this point we've utilized bits and pieces of the `pandas` library to perform various tasks. In this project we will formally introduce `pandas` and `numpy`, and utilize their capabilities to solve data-driven problems. - -**Context:** By now you'll have had some limited exposure to `pandas`. This is the first in a three project series that covers some of the main components of both the `numpy` and `pandas` libraries. We will take a two project intermission to learn about functions, and then continue. - -**Scope:** python, pandas, numpy, DataFrames, Series, ndarrays, indexing - -.Learning objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Use numpy, scipy, and pandas to solve a variety of data-driven problems. -- Demonstrate the ability to read and write data of various formats using various packages. -- View and access data inside DataFrames, Series, and ndarrays. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/stackoverflow/unprocessed/2018.csv` - -`/class/datamine/data/stackoverflow/unprocessed/2018.parquet` - -`/class/datamine/data/stackoverflow/unprocessed/2018.feather` - -== Questions - -=== Question 1 - -Take a look at the https://pandas.pydata.org/docs/reference/io.html[`pandas` docs]. There are a _lot_ of formats that `pandas` has the ability to read. The most popular formats in this course are: csv (with commas or some other separator), excel, json, or some database. CSV is very prevalent, but it was not designed to work well with large amounts of data. Newer formats like parquet and feather are designed from the ground up to be efficient, and take advantage of special processor instruction set called SIMD. The benefits of using these formats can be significant. Let's do some experiments! - -How much space do each of the following files take up on Scholar: `2018.csv`, `2018.parquet`, and `2018.feather`? How much smaller (as a percentage) is the parquet file than the csv? How much smaller (as a percentage) is the feather file than the csv? Use f-strings to format the percentages. - -Time reading in the following files: `2018.csv`, `2018.parquet`, and `2018.feather`. How much faster (as a percentage) is reading the parquet file than the csv? How much faster (as a percentage) is reading the feather file than the csv? Use f-strings to format the percentages. - -To time a piece of code, you can use the `block-timer` package: - -```{python, eval=F} -from block_timer.timer import Timer -with Timer(title="Using dict to declare a dict") as t1: - my_dict = dict() -with Timer(title="Using {} to declare a dict") as t2: - my_dict = {} -# or if you need more fine-tuned values -print(t1.elapsed) -print(t2.elapsed) -``` - -Read the `2018.csv` file into a `pandas` DataFrame called `my2018`. Time writing the contents of `my2018` to the following files: `2018.csv`, `2018.parquet`, and `2018.feather`. Write the files to your scratch directory: `/scratch/scholar/`, where `` is your username. How much faster (as a percentage) is writing the parquet file than the csv? How much faster (as a percentage) is writing the feather file than the csv? Use f-strings to format the percentages. - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -A _method_ is just a function associated with an object or class. For example, `mean` is just a method of the `pandas` DataFrame: - -[source,python] ----- -# myDF is an object of class DataFrame -# mean is a method of the DataFrame class -myDF.mean() ----- - -In `pandas` there are two main methods used for indexing: https://pandas.pydata.org/docs/user_guide/indexing.html#different-choices-for-indexing[`loc` and `iloc`]. Use the column `Student` and indexing in `pandas` to calculate what percentage of respondents are students and not students. Consider the respondent to be a student if the `Student` column is anything but "No". Create a new DataFrame called `not_students` that is a subset of the original dataset _without_ students. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -In `pandas`, if you were to isolate a single column using indexing, like this: - -[source,python] ----- -myDF.loc[:, "Student"] ----- - -The result would be a `pandas` Series. A Series is the 1-dimensional equivalent of a DataFrame. - -[source,python] ----- -type(myDF.loc[:, "Student"]) # pandas.core.series.Series ----- - -`pandas` and `numpy` make it very easy to convert between a Series, ndarray, and list. https://miro.medium.com/max/1400/1*rv1JADavAhDKN4-3iM7phQ.png[Here] is a very useful graphic to highlight how to do this. Look at the `DevType` column in `not_students`. As you can see, a single value may contain a list of semi-colon-separated professions. Create a list with a unique group of all the possible professions. Consider each semi-colon-separated value a profession. How many professions are there? - -It looks like somehow the profession "Student" got in there even though we filtered by the `Student` column. Use `not_students` to get a subset of our data for which the respondents replied "No" to `Student`, yet put "Student" as one of many possible `DevType`s. How many respondents are in that subset? - -[TIP] -==== -If you have a column containing strings in `pandas`, and would like to use string methods on every string in the column, you can use `.str`. For example: - -[source,python] ----- -# this would use the `strip` string method on each value in myColumn, and compare them to '' -# `contains` is another useful string method... -myDF.loc[myDF.loc[:, "myColumn"].str.strip() == '', :] ----- -==== - -[TIP] -==== -See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing[here]. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The number of professions there are. -- The number of respondents that replied "No" to `Student`, yet put "Student" as the `DevType`. -==== - -=== Question 4 - -As you can see, while perhaps a bit more strict, indexing in `pandas` is not that much more difficult than indexing in R. While not always necessary, remembering to put ":" to indicate "all columns" or "all rows" makes life easier. In addition, remembering to put parentheses around logical groupings is also a good thing. Practice makes perfect! Randomly select 100 females and 100 males. How many of each sample is in each `Age` category? (_Do not_ use the `sample` method yet, but instead use numeric indexing and `random`) - -```{python} -import random -print(f"A random integer between 1 and 100 is {random.randint(1, 101)}") -``` - -It would be nice to visualize these results. `pandas` Series have some built in methods to create plots. Use this method to generate a bar plot for both females and males. How do they compare? - -[TIP] -==== -You may need to import `matplotlib` in order to display the graphic: - -[source,python] ----- -import matplotlib.pyplot as plt -# female barplot code here -plt.show() -# male barplot code here -plt.show() ----- -==== - -++++ - -++++ - -[TIP] -==== -Once you have your female and male DataFrames, the `value_counts` method found https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats[here] may be particularly useful. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -`pandas` really helps out when it comes to working with data in Python. This is a really cool dataset, use your newfound skills to do a mini-analysis. Your mini-analysis should include 1 or more graphics, along with some interesting observation you made while exploring the data. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- A graphic. -- 1-2 sentences explaining your interesting observation and graphic. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project06.adoc deleted file mode 100644 index 09547611a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project06.adoc +++ /dev/null @@ -1,101 +0,0 @@ -= STAT 19000: Project 6 -- Spring 2021 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in _many_ fields. It can be pretty fun too! In this project, we are going to take a small hiatus from the regular stream of projects to do some data visualizations. - -**Context:** We've been working hard all semester and learning valuable skills. In this project we are going to ask you to examine some plots, write a little bit, and use your creative energies to create good visualizations about the flight data. - -**Scope:** python, r, visualizing data - -.Learning objectives -**** -- Demostrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/flights/*.csv` (all csv files) - -== Questions - -=== Question 1 - -http://stat-computing.org/dataexpo/2009/posters/[Here] are the results from the 2009 Data Expo poster competition. The object of the competition was to visualize interesting information from the flights dataset. Examine all 8 posters and write a single sentence for each poster with your first impression(s). An example of an impression that will not get full credit would be: "My first impression is that this poster is bad and doesn't look organized." An example of an impression that will get full credit would be: "My first impression is that the author had a good visualization-to-text ratio and it seems easy to follow along." - -++++ - -++++ - -.Items to submit -==== -- 8 bullets, each containing a sentence with the first impression of the 8 visualizations. Order should be "first place", to "honourable mention", followed by "other posters" in the given order. Or, label which graphic each sentence is about. -==== - -=== Question 2 - -https://www.amazon.com/dp/0985911123/[Creating More Effective Graphs] by Dr. Naomi Robbins and https://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414/ref=sr_1_1?dchild=1&keywords=elements+of+graphing+data&qid=1614013761&sr=8-1[The Elements of Graphing Data] by Dr. William Cleveland at Purdue University, are two excellent books about data visualization. Read the following excerpts from the books (respectively), and list 2 things you learned, or found interesting from _each_ book. - -- https://thedatamine.github.io/the-examples-book/files/CreatingMoreEffectiveGraphs.pdf[Excerpt 1] -- https://thedatamine.github.io/the-examples-book/files/ElementsOfGraphingData.pdf[Excerpt 2] - -++++ - -++++ - -.Items to submit -==== -- Two bullets for each book with items you learned or found interesting. -==== - -=== Question 3 - -Of the 7 posters with at least 3 plots and/or maps, choose 1 poster that you think you could improve upon or "out plot". Create 4 plots/maps that either: - -1. Improve upon a plot from the poster you chose, or -2. Show a completely different plot that does a good job of getting an idea or observation across, or -3. Ruin a plot. Purposefully break the best practices you've learned about in order to make the visualization misleading. (limited to 1 of the 4 plots) - -For each plot/map where you choose to do (1), include 1-2 sentences explaining what exactly you improved upon and how. Point out some of the best practices from the 2 provided texts that you followed. For each plot/map where you choose to do (2), include 1-2 sentences explaining your graphic and outlining the best practices from the 2 texts that you followed. For each plot/map where you choose to do (3), include 1-2 sentences explaining what you changed, what principle it broke, and how it made the plot misleading or worse. - -While we are not asking you to create a poster, please use RMarkdown to keep your plots, code, and text nicely formatted and organized. The more like a story your project reads, the better. You are free to use either R or Python or both to complete this project. Please note that it would be unadvisable to use an interactive plotting package like `plotly`, as these packages will not render plots from within RMarkdown in RStudio. - -Some useful R packages: - -- Base R Plotting: bar, plot, lines, etc. -- https://thedatamine.github.io/the-examples-book/r.html#r-plot_usmap[usmap] -- https://uc-r.github.io/ggplot_intro[ggplot] - -Some useful Python packages: - -- https://thedatamine.github.io/the-examples-book/python.html#p-matplotlib[matplotlib] -- https://plotnine.readthedocs.io/en/stable/#[plotnine] - -++++ - -++++ - -.Items to submit -==== -- All associated R/Python code you used to wrangling the data and create your graphics. -- 4 plots, with at least 4 associated RMarkdown code chunks. -- 1-2 sentences per plot explaining what exactly you improved upon, what best practices from the texts you used, and how. If it is a brand new visualization, describe and explain your graphic, outlining the best practices from the 2 texts that you followed. If it is the ruined plot you chose, explain what you changed, what principle it broke, and how it made the plot misleading or worse. -==== - -=== Question 4 - -Now that you've been exploring data visualization, copy, paste, and update your first impressions from question (1) with your updated impressions. Which impression changed the most, and why? - -++++ - -++++ - -.Items to submit -==== -- 8 bullets with updated impressions (still just a sentence or two) from question (1). -- A sentence explaining which impression changed the most and why. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project07.adoc deleted file mode 100644 index 68f461486..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project07.adoc +++ /dev/null @@ -1,288 +0,0 @@ -= STAT 19000: Project 7 -- Spring 2021 - -**Motivation:** There is one pretty major topic that we have yet to explore in Python -- functions! A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code. - -**Context:** We are taking a small hiatus from our `pandas` and `numpy` focused series to learn about and write our own functions in Python! - -**Scope:** python, functions, pandas - -.Learning objectives -**** -- Comprehend what a function is, and the components of a function in Python. -- Differentiate between positional and keyword arguments. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/yelp/data/parquet` - -== Questions - -=== Question 1 - -You've been given a path to a folder for a dataset. Explore the files. Give a brief description of the files and what each file contains. - -[NOTE] -==== -Take a look at the size of each of the files. If you are interested in experimenting, try using `pandas` `read_json` function to read the `yelp_academic_dataset_user.json` file in the json folder `/class/datamine/data/yelp/data/json/yelp_academic_dataset_user.json`. Even with the large amount of memory available to you, this should fail. In order to make it work you would need to use the `chunksize` option to read the data in bit by bit. Now consider that the `reviews.parquet` file is .3gb _larger_ than the `yelp_academic_dataset_user.json` file, but can be read in with no problem. That is seriously impressive! -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The name of each dataset and a brief summary of each dataset. No more than 1-2 sentences about each dataset. -==== - -=== Question 2 - -Read the `businesses.parquet` file into a `pandas` DataFrame called `businesses`. Take a look to the `hours` and `attributes` columns. If you look closely, you'll observe that both columns contain a lot more than a single feature. In fact, the `attributes` column contains 39 features and the `hours` column contains 7! - -[source,python] ----- -len(businesses.loc[:, "attributes"].iloc[0].keys()) # 39 -len(businesses.loc[:, "hours"].iloc[0].keys()) # 7 ----- - -Let's start by writing a simple function. Create a function called `has_attributes` that takes a `business_id` as an argument, and returns `True` if the business has any `attributes` and `False` otherwise. Test it with the following code: - -[source,python] ----- -print(has_attributes('f9NumwFMBDn751xgFiRbNA')) # True -print(has_attributes('XNoUzKckATkOD1hP6vghZg')) # False -print(has_attributes('Yzvjg0SayhoZgCljUJRF9Q')) # True -print(has_attributes('7uYJJpwORUbCirC1mz8n9Q')) # False ----- - -While this is useful to get whether or not a single business has any attributes, if you wanted to apply this function to the entire `attributes` column/Series, you would just use the `notna` method: - -[source,python] ----- -businesses.loc[:, "attributes"].notna() ----- - -[IMPORTANT] -==== -Make sure your return value is of type `bool`. To check this: - -[source,python] ----- -type(True) # bool -type("True") # str ----- -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running the provided "test" code. -==== - -=== Question 3 - -Take a look at the `attributes` of the first business: - -[source,python] ----- -businesses.loc[:, "attributes"].iloc[0] ----- - -What is the type of the value? Let's assume the company you work for gets data formatted like `businesses` each week, but your boss wants the 39 features in `attributes` and the 7 features in `hours` to become their own columns. Write a function called `fix_businesses_data` that accepts an argument called `data_path` (of type `str`) that is a full path to a parquet file that is in the exact same format as `businesses.parquet`. In addition to the `data_path` argument, `fix_businesses_data` should accept another argument called `output_dir` (of type `str`). `output_dir` should contain the path where you want your "fixed" parquet file to output. `fix_businesses_data` should return `None`. - -The result of your function, should be a new file called `new_businesses.parquet` saved in the `output_dir`, the data in this file should no longer contain either the `attributes` or `hours` columns. Instead, each row should contain 39+7 new columns. Test your function out: - -[source,python] ----- -from pathlib import Path -my_username = "kamstut" # replace "kamstut" with YOUR username -fix_businesses_data(data_path="/class/datamine/data/yelp/data/parquet/businesses.parquet", output_dir=f"/scratch/scholar/{my_username}") -# see if output exists -p = Path(f"/scratch/scholar/{my_username}").glob('**/*') -files = [x for x in p if x.is_file()] -print(files) ----- - -[IMPORTANT] -==== -Make sure that either `/scratch/scholar/{my_username}` or `/scratch/scholar/{my_username}/` will work as arguments to `output_dir`. If you use the `pathlib` library, as shown in the provided function "skeleton" below, both will work automatically! -==== - -[source,python] ----- -from pathlib import Path -def fix_businesses_data(data_path: str, output_dir: str) -> None: - """ - fix_data accepts a parquet file that contains data in a specific format. - fix_data "explodes" the attributes and hours columns into 39+7=46 new - columns. - Args: - data_path (str): Full path to a file in the same format as businesses.parquet. - output_dir (str): Path to a directory where new_businesses.parquet should be output. - """ - # read in original parquet file - businesses = pd.read_parquet(data_path) - - # unnest the attributes column - - # unnest the hours column - - # output new file - businesses.to_parquet(str(Path(f"{output_dir}").joinpath("new_businesses.parquet"))) - - return None ----- - -++++ - -++++ - -[TIP] -==== -Check out the code below, notice how using `pathlib` handles whether or not we have the trailing `/`. -==== - -[source,python] ----- -from pathlib import Path -print(Path("/class/datamine/data/").joinpath("my_file.txt")) -print(Path("/class/datamine/data").joinpath("my_file.txt")) ----- - -[TIP] -==== -You can test out your function on `/class/datamine/data/yelp/data/parquet/businesses_sample.parquet` to not waste as much time. -==== - -[TIP] -==== -If we were using R and the `tidyverse` package, this sort of behavior is called "unnesting". You can read more about it https://tidyr.tidyverse.org/reference/nest.html[here]. -==== - -[TIP] -==== -https://stackoverflow.com/questions/38231591/splitting-dictionary-list-inside-a-pandas-column-into-separate-columns[This] stackoverflow post should be _very_ useful! Specifically, run this code and take a look at the output: - -[source,python] ----- -businesses -businesses.loc[0:4, "attributes"].apply(pd.Series) ----- -==== - -Notice that some rows have json, and others have `None`: - -[source,python] ----- -businesses.loc[0, "attributes"] # has json -businesses.loc[2, "attributes"] # has None ----- - -This method allows us to handle both cases. If the row has json it converts the values, if it has `None` it just puts each column with a value of `None`. - -[TIP] -==== -https://stackoverflow.com/questions/44723377/pandas-combining-two-dataframes-horizontally[Here] is an example that shows you how to concatenate (combine) dataframes. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -That's a pretty powerful function, and could definitely be useful. What if, instead of working on just our specifically formatted parquet file, we wrote a function that worked for _any_ `pandas` DataFrame? Write a function called `unnest` that accepts a `pandas` DataFrame as an argument (let's call this argument `myDF`), and a list of columns (let's call this argument `columns`), and returns a DataFrame where the provided columns are unnested. - -++++ - -++++ - -[IMPORTANT] -==== -You may write `unnest` so that the resulting dataframe contains the original dataframe _and_ the unnested columns, or you may return just the unnested columns -- both will be accepted solutions. -==== - -[TIP] -==== -The following should work: - -[source,python] ----- -businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet") -new_businesses_df = unnest(businesses, ["attributes", ]) -new_businesses_df.shape # (209393, 39) -new_businesses_df.head() -new_businesses_df = unnest(businesses, ["attributes", "hours"]) -new_businesses_df.shape # (209393, 46) -new_businesses_df.head() ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running the provided code. -==== - -=== Question 5 - -Try out the code below. If a provided column isn't already nested, the column name is ruined and the data is changed. If the column doesn't already exist, a KeyError is thrown. Modify our function from question (4) to skip unnesting if the column doesn't exist. In addition, modify the function from question (4) to skip the column if the column isn't nested. Let's consider a column nested if the value of the column is a `dict`, and not nested otherwise. - -[source,python] ----- -businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet") -new_businesses_df = unnest(businesses, ["doesntexist",]) # KeyError -new_businesses_df = unnest(businesses, ["postal_code",]) # not nested ----- - -To test your code, run the following. The result should be a DataFrame where `attributes` has been unnested, and that is it. - -[source,python] ----- -businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet") -results = unnest(businesses, ["doesntexist", "postal_code", "attributes"]) -results.shape # (209393, 39) -results.head() ----- - -++++ - -++++ - -[TIP] -==== -To see if a variable is a `dict` you could use `type`: - -[source,python] ----- -my_variable = {'key': 'value'} -type(my_variable) ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running the provided code. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project08.adoc deleted file mode 100644 index a2445aeb8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project08.adoc +++ /dev/null @@ -1,203 +0,0 @@ -= STAT 19000: Project 8 -- Spring 2021 - -**Motivation:** A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code. There are some pretty powerful features of functions that we have yet to explore that aren't necessarily present in R. In this project we will continue to learn about and harness the power of functions to solve data-driven problems. - -**Context:** We are taking a small hiatus from our `pandas` and `numpy` focused series to learn about and write our own functions in Python! - -**Scope:** python, functions, pandas - -.Learning objectives -**** -- Comprehend what a function is, and the components of a function in Python. -- Differentiate between positional and keyword arguments. -- Learn about packing and unpacking variables and arguments. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/yelp/data/parquet` - -== Questions - -=== Question 1 - -The company you work for is assigning you to the task of building out some functions for the new API they've built. Please load these two `pandas` DataFrames: - -[source,python] ----- -users = pd.read_parquet("/class/datamine/data/yelp/data/parquet/users.parquet") -reviews = pd.read_parquet("/class/datamine/data/yelp/data/parquet/reviews.parquet") ----- - -You do **not** need these four DataFrames in this project. - -[source,python] ----- -photos = pd.read_parquet("/class/datamine/data/yelp/data/parquet/photos.parquet") -businesses = pd.read_parquet("/class/datamine/data/yelp/data/parquet/businesses.parquet") -checkins = pd.read_parquet("/class/datamine/data/yelp/data/parquet/checkins.parquet") -tips = pd.read_parquet("/class/datamine/data/yelp/data/parquet/tips.parquet") ----- - -You would expect that friends may have a similar taste in restaurants or businesses. Write a function called `get_friends_data` that accepts a `user_id` as an argument, and returns a `pandas` DataFrame with the information in the `users` DataFrame for each friend of `user_id`. Look at the solutions from the previous project, as well as https://docs.python.org/3.8/library/typing.html[this] page. Add type hints for your function. You should have a type hint for our argument, `user_id`, as well as a type hint for the returned data. In addition to type hints, make sure to document your function with a docstring. - -[TIP] -==== -Every function in the solutions for last week's projects has a docstring. You can use this as a reference. -==== - -[TIP] -==== -You should get the same number of friends for the following code: -==== - -++++ - -++++ - -++++ - -++++ - -[source,python] ----- -print(get_friends_data("ntlvfPzc8eglqvk92iDIAw").shape) # (13,22) -print(get_friends_data("AY-laIws3S7YXNl_f_D6rQ").shape) # (1, 22) -print(get_friends_data("xvu8G900tezTzbbfqmTKvA").shape) # (193,22) ----- - -[NOTE] -==== -It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting). -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Write a function called `calculate_avg_business_stars` that accepts a `business_id` and returns the average number of stars that business received in reviews. Like in question (1) make sure to add type hints and docstrings. In addition, add comments when and if they are necessary. - -There is a really cool method that gives us the same "powers" that `tapply` gives us in R. Use the `groupby` method from `pandas` to calculate the average stars for all businesses. Index the result to confirm that your `calculate_avg_business_stars` function worked properly. - -[TIP] -==== -You should get the same average number of start value for the following code: - -[source,python] ----- -print(calculate_avg_business_stars("f9NumwFMBDn751xgFiRbNA")) # 3.1025641025641026 ----- -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Write a function called `visualize_stars_over_time` that accepts a `business_id` and returns a line plot that shows the average number of stars for each year the business has review data. Like in previous questions, make sure to add type hints and docstrings. In addition, add comments when (and if) necessary. You can test your function with some of these: - -[source,python] ----- -visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g') ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Modify question (3), and add an argument called `granularity` that dictates whether the plot will show the average rating over years, or months. `granularity` should accept one of two strings: "years", or "months". By default, if `granularity` isn't specified, it should be "years". - -[source,python] ----- -visualize_stars_over_time('RESDUcs7fIiihp38-d6_6g', "months") ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Modify question (4) to accept multiple business_id's, and create a line for each id. Each of the following should work: - -[source,python] ----- -visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "months") -visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", "months") -visualize_stars_over_time("RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg", granularity="years") ----- - -[TIP] -==== -Use `plt.show` to decide when to show your complete plot and start anew. -==== - -++++ - -++++ - -[NOTE] -==== -It is sufficient to just load the first of these three examples, when you Knit your project (to save time during Knitting). -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 6 - -After some thought, your boss decided that using the function from question (5) would get pretty tedious when there are a lot of businesses to include in the plot. You disagree. You think there is a way to pass a list of `business_id`s _without_ modifying your function, but rather how you pass the arguments to the function. Demonstrate how to do this with the list provided: - -[source,python] ----- -our_businesses = ["RESDUcs7fIiihp38-d6_6g", "4JNXUYY8wbaaDmk3BPzlWw", "K7lWdNUhCbcnEvI0NhGewg"] -# modify something below to make this work: -visualize_stars_over_time(our_businesses, granularity="years") ----- - -[TIP] -==== -Google "python packing unpacking arguments". -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project09.adoc deleted file mode 100644 index 20c2fbf8e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project09.adoc +++ /dev/null @@ -1,174 +0,0 @@ -= STAT 19000: Project 9 -- Spring 2021 - -**Motivation:** We've covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system! - -**Context:** At this point in the semester we have a solid grasp on Python basics, and are looking to build our skills using the `pandas` and `numpy` packages to build a data-driven recommendation system for beers. - -**Scope:** python, pandas, numpy - -.Learning objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Use numpy, scipy, and pandas to solve a variety of data-driven problems. -- Demonstrate the ability to read and write data of various formats using various packages. -- View and access data inside DataFrames, Series, and ndarrays. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/beer` - -Load the following datasets up and assume they are always available: - -[source,python] ----- -beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet") -breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet") -reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet") ----- - -== Questions - -=== Question 1 - -Write a function called `prepare_data` that accepts an argument called `myDF` that is a `pandas` DataFrame. In addition, `prepare_data` should accept an argument called `min_num_reviews` that is an integer representing the minimum amount of reviews that the user and the beer must have, to be included in the data. The function `prepare_data` should return a `pandas` DataFrame with the following properties: - -First remove all rows where `score` or `username` or `beer_id` is missing, like this: - -[source,python] ----- - myDF = myDF.loc[myDF.loc[:, "score"].notna(), :] - myDF = myDF.loc[myDF.loc[:, "username"].notna(), :] - myDF = myDF.loc[myDF.loc[:, "beer_id"].notna(), :] - myDF.reset_index(drop=True) ----- - -Among the remaining rows, choose the rows of `myDF` that have a user (`username`) and a `beer_id` that each occur at least `min_num_reviews` times in `myDF`. - -[source,python] ----- -train = prepare_data(reviews, 1000) -print(train.shape) # (952105, 10) ----- - -[TIP] -==== -We added two examples of how to do this with the election data (instead of the beer review data) in the book: https://thedatamine.github.io/the-examples-book/python.html#p-reading-and-writing-data[cleaning and filtering data] -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Run the function in question (1). Use `train=prepare_data(reviews, 1000)`. The basis of our recommendation system will be to "match" a user to another user will similar taste in beer. Different users will have different means and variances in their scores. If we are going to compare users' scores, we should _standardize_ users' scores. Update the `train` DataFrame with 1 additional column: `standardized_score`. To calculate the `standardized_score`, take each individual score, and subtract off the user's average score and divide that result by the user's score's standard deviation. - -In R, we have the following code: - -[source,r] ----- -myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8)) -myMean = tapply(myDF$b + myDF$c, myDF$a, mean) -myMeanDF = data.frame(a=as.numeric(names(myMean)), mean=myMean) -myDF = merge(myDF, myMeanDF, by='a') ----- - -Or you could also use a _very_ handy package called tidyverse in R to do the same thing: - -[source,r] ----- -library(tidyverse) -myDF <- data.frame(a=c(1,2,3,1,2,3), b=c(6,5,4,5,5,5), c=c(9,9,9,8,8,8)) -myDF %>% - group_by(a) %>% - mutate(d=mean(b+c)) ----- - -Unfortunately, there isn't a _great_ way to do this in Python: - -[source,python] ----- -def summer(data): - data['d'] = (data['b']+data['c']).mean() - return data -myDF = myDF.groupby(["a"]).apply(summer) ----- - -Create a new column `standardized_score`. Calculate the `standardized_score` by taking the score and subtracting the average score, then divide by the standard deviation. As it may take a minute or two to create this new column, feel free to test it on a small sample of the reviews DataFrame: - -[source,python] ----- -import pandas as pd -testDF = pd.read_parquet('/class/datamine/data/beer/reviews_sample.parquet') ----- - -[TIP] -==== -Don't forget about the `pandas` DataFrame `std` and `mean` methods. -==== - -[NOTE] -==== -If you are worried about getting `NA`s, do not worry. The only way we would get `NA`s would be if there is only a single review for the user (which we took care of by limiting to users with at least 1000 reviews), or if there is no variance in a user's scores (which doesn't happen). -==== - -[NOTE] -==== -We added an example about how to do this with the election data in the book: https://thedatamine.github.io/the-examples-book/python.html#p-reading-and-writing-data[standardizing data example] -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Use the `pivot_table` method from `pandas` to put your `train` data into "wide" format. What this means is that each row in the new DataFrame will be a `username`, and each column will be a `beer_id`. Each cell will contain the `standardized_score` for the given `username` and `beer` combination. Call the resulting DataFrame `score_matrix`. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output the `head` and `shape` of `score_matrix`. -==== - -=== Question 4 - -The result from question (3) should be a sparse matrix (lots of missing data!). Let's fill in the missing data. For now, let's fill in a beer_id's missing data by filling in every missing value with the average score for the beer. - -[TIP] -==== -The `fillna` method in `pandas` will be very helpful! -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output the `head` of `score_matrix`. -==== - -**Congratulations! Next week, we will complete our recommendation system!** \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project10.adoc deleted file mode 100644 index 579649934..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project10.adoc +++ /dev/null @@ -1,207 +0,0 @@ -= STAT 19000: Project 10 -- Spring 2021 - -**Motivation:** We've covered a lot of material in a very short amount of time. At this point in time, you have so many powerful tools at your disposal. Last semester in project 14 we used our new skills to build a beer recommendation system. It is pretty generous to call what we built a recommendation system. In the next couple of projects, we will use our Python skills to build a real beer recommendation system! - -**Context:** This is the third project in a series of projects designed to learn about the `pandas` and `numpy` packages. In this project we build on to our previous project to finalize our beer recommendation system. - -**Scope:** python, numpy, pandas - -.Learning objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Use numpy, scipy, and pandas to solve a variety of data-driven problems. -- Demonstrate the ability to read and write data of various formats using various packages. -- View and access data inside DataFrames, Series, and ndarrays. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/beer` - -Load the following datasets up and assume they are always available: - -[source,python] ----- -beers = pd.read_parquet("/class/datamine/data/beer/beers.parquet") -breweries = pd.read_parquet("/class/datamine/data/beer/breweries.parquet") -reviews = pd.read_parquet("/class/datamine/data/beer/reviews.parquet") ----- - -=== Project 09 Solution - -Below is the solution for the previous projects, as we'll be using its methods and don't want to leave anybody behind: - -[source,python] ----- -def prepare_data(myDF, min_num_reviews): - - # remove rows where score is na - myDF = myDF.loc[myDF.loc[:, "score"].notna(), :] - # get a list of usernames that have at least min_num_reviews - usernames = myDF.loc[:, "username"].value_counts() >= min_num_reviews - usernames = usernames.loc[usernames].index.values.tolist() - # get a list of beer_ids that have at least min_num_reviews - beerids = myDF.loc[:, "beer_id"].value_counts() >= min_num_reviews - beerids = beerids.loc[beerids].index.values.tolist() - # first remove all rows where the username has less than min_num_reviews - myDF = myDF.loc[myDF.loc[:, "username"].isin(usernames), :] - - # remove rows where the beer_id has less than min_num_reviews - myDF = myDF.loc[myDF.loc[:, "beer_id"].isin(beerids), :] - - return myDF -train = prepare_data(reviews, 1000) ----- - -[source,python] ----- -def mutate_std_score(data: pd.DataFrame) -> pd.DataFrame: - """ - mutate_std_score is a function to use in conjunction with - pd.apply and pd.groupby to create a new column that is - the standardized score. - Args: - data (pd.DataFrame): A pandas DataFrame. - Returns: - pd.DataFrame: A modified pandas DataFrame. - """ - data['standardized_score'] = (data['score'] - data['score'].mean())/data['score'].std() - return data -train = train.groupby(["username"]).apply(mutate_std_score) ----- - -[source,python] ----- -score_matrix = pd.pivot_table(train, values='standardized_score', index='username', columns='beer_id') -print(score_matrix.shape) -score_matrix.head() ----- - -[source,python] ----- -score_matrix = score_matrix.fillna(score_matrix.mean(axis=0)) -score_matrix.head() ----- - -== Questions - -=== Question 1 - -If you struggled or did not do the previous project, or would like to start fresh, please see the solutions to the previous project (will be posted Saturday morning) and feel free to use them as your own. Cosine similarity is a measure of similarity between two non-zero vectors. It is used in a variety of ways in data science. https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a[Here] is a pretty good article that tries to give some intuition into it. `sklearn` provides us with a function that calculates cosine similarity: - -[source,python] ----- -from sklearn.metrics.pairwise import cosine_similarity ----- - -Use the `cosine_similarity` function on our `score_matrix`. The result will be a `numpy` array. Use the `fill_diagonal` method from `numpy` to fill the diagonals with 0. Convert the array back to a `pandas` DataFrame. Make sure to manually assign the indexes of the new DataFrame to be equal to `score_matrix.index`. Lastly, manually assign the columns to be `score_matrix.index` as well. The end result should be a matrix with usernames on both the x and y axes. Each value in the cell represents how "close" one user is to another. Normally the values in the diagonals would be 1 because the same user is 100% similar. To prevent this we forced the diagonals to be 0. Name the final result `cosine_similarity_matrix`. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- `head` of `cosine_similarity_matrix`. -==== - -=== Question 2 - -Write a function called `get_knn` that accepts the `cosine_similarity_matrix`, a `username`, and a value, `k`. The function `get_knn` should return a `pandas` Series or list containing the usernames of the `k` most similar users to the input `username`. - -[TIP] -==== -This may _sound_ difficult, but it is not. It really only involves sorting some values and grabbing the first `k`. -==== - -Test it on the following; we demonstrate the output if you return a list: - -[source,python] ----- -k_similar=get_knn(cosine_similarity_matrix,"2GOOFY",4) -print(k_similar) # ['Phil-Fresh', 'mishi_d', 'SlightlyGrey', 'MI_beerdrinker'] ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Let's test `get_knn` to see if the results make sense. Pick out a user, and the most similar other user. First, get a DataFrame (let's call it `aux`) containing just their reviews. The result should be a DataFrame that looks just like the `reviews` DataFrame, but just contains your users' reviews. - -Next, look at `aux`. Wouldn't it be nice to get a DataFrame where the `beer_id` is the row index, the first column contains the scores for the first user, and the second column contains the scores for the second user? Use the `pivot_table` method to accomplish this, and save the result as `aux`. - -Lastly, use the `dropna` method to remove all rows where at least one of the users has an `NA` value. Sort the values in `aux` using the `sort_values` method. Take a look at the result and write 1-2 sentences explaining whether or not you think the users rated the beers similarly. - -[TIP] -==== -You could also create a scatter plot using the resulting DataFrame. If it is a good match the plot should look like a positive sloping line. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining whether or not you think the users rated the beers similarly. -==== - -=== Question 4 - -We are so close, and things are looking good! The next step for our system, is to write a function that finds recommendations for a given user. Write a function called `recommend_beers`, that accepts three arguments: the `train` DataFrame, a `username`, a `cosine_similarity_matrix`, and `k` (how many neighbors to use). The function `recommend_beers` should return the top 5 recommendations. - -Calculate the recommendations by: - -1. Finding the `k` nearest neighbors of the input `username`. -2. Get a DataFrame with all of the reviews from `train` for every neighbor. Let's call this `aux`. -3. Get a list of all `beer_id` that the user with `username` has reviewed. -4. Remove all beers from `aux` that have already been reviewed by the user with `username`. -5. Group by `beer_id` and calculate the mean `standardized_score`. -6. Sort the results in descending order, and return the top 5 `beer_id`s. - -Test it on the following: - -[source,python] ----- -recommend_beers(train, "22Blue", cosine_similarity_matrix, 30) # [40057, 69522, 22172, 59672, 86487] ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -(optional, 0 pts) Improve our recommendation system! Below are some suggestions, don't feel limited by them: - -- Instead of returning a list of `beer_id`, return the beer info from the `beers` dataset. -- Remove all retired beers. -- Somehow add a cool plot. -- Etc. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project11.adoc deleted file mode 100644 index 6058b10a3..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project11.adoc +++ /dev/null @@ -1,98 +0,0 @@ -= STAT 19000: Project 11 -- Spring 2021 - -**Motivation:** We've had a pretty intense series of projects recently, and, although you may not have digested everything fully, you _may_ be surprised at how far you've come! What better way to realize this but to take a look at some familiar questions that you've solved in the past in R, and solve them in Python instead? You will (a) have the solutions in R to be able to compare and contrast what you come up with in Python, and (b) be able to fill in any gaps you find you have along the way. - -**Context:** We've just finished a two project series where we built a beer recommendation system using Python. In this project, we are going to take a (hopefully restful) step back and tackle some familiar data wrangling tasks, but in Python instead of R. - -**Scope:** python, r - -.Learning objectives -**** -- Use numpy, scipy, and pandas to solve a variety of data-driven problems. -- Demonstrate the ability to read and write data of various formats using various packages. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/fars` - -== Questions - -=== Question 1 - -The `fars` dataset contains a series of folders labeled by year. In each year folder there is (at least) the files `ACCIDENT.CSV`, `PERSON.CSV`, and `VEHICLE.CSV`. If you take a peek at any `ACCIDENT.CSV` file in any year, you'll notice that the column `YEAR` only contains the last two digits of the year. Add a new `YEAR` column that contains the full year. Use the `pd.concat` function to create a DataFrame called `accidents` that combines the `ACCIDENT.CSV` files from the years 1975 through 1981 (inclusive) into one big dataset. After (or before) creating that `accidents` DataFrame, change the values in the `YEAR` column from two digits to four digits (i.e., paste a 19 onto each year value). - -[TIP] -==== -One way to append strings to every value in a column is to first convert the column to `str` using `astype` and then use the `+` operator, like normal: - -[source,python] ----- -myDF["myCol"].astype(str) + "appending_this_string" ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- `head` of the `accidents` dataframe. -==== - -=== Question 2 - -Using the new `accidents` data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus? - -[TIP] -==== -Look at the variables `DRUNK_DR` and `SCH_BUS`. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Again using the `accidents` data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents? - -[IMPORTANT] -==== -Does the `groupby` method seem familiar to you? It should! It is extremely similar to `tapply` in R. Typically functions that behave like `tapply` are called something like "groupby" -- R is the oddball this time. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Again using the `accidents` data frame: Calculate the mean number of motorists involved in an accident (column `PERSONS`) with `i` drunk drivers (column `DRUNK_DR`), where `i` takes the values from 0 through 6. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals. - -[TIP] -==== -You'll want to pay special attention to the `include_lowest` option of `pandas.cut` (similarly to R's `cut`). -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project12.adoc deleted file mode 100644 index 9e69d5ced..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project12.adoc +++ /dev/null @@ -1,214 +0,0 @@ -= STAT 19000: Project 12 -- Spring 2021 - -**Motivation:** We'd be remiss spending almost an entire semester solving data driven problems in python without covering the basics of classes. Whether or not you will ever choose to use this feature in your work, it is best to at least understand some of the basics so you can navigate libraries and other code that does use it. - -**Context:** We've spent nearly the entire semester solving data driven problems in python, and now we are going to learn about one of the primary features in python: classes. Python is an object oriented programming language, and as such, much of python, and the libraries you use in python are objects which have properties and methods. In this project we will explore some of the terminology and syntax relating to classes. - -**Scope:** python - -.Learning objectives -**** -- Explain the basics of object oriented programming, and what a class is. -- Use classes to solve a data-driven problem. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -In this project, and the next, we will learn about classes by simulating a simplified version of Blackjack! Don't worry, while this may sound intimidating, there is very little coding involved, and much of the code will be provided to you to use! - -== Questions - -=== Question 1 - -Create a `Card` class with a `number` and a `suit`. The `number` should be any number between 2-10 or any of a 'J', 'Q', 'K', or 'A' (for Jack, Queen, King, or Ace). The `suit` should be any of: "Clubs", "Hearts", "Spades", or "Diamonds". You should initialize a `Card` by first providing the `number` then the `suit`. Make sure that any provided `number` is one of our valid values, if it isn't, throw an exception (that is, stop the function and return a message): - -Here are some examples to test: - -[source,python] ----- -my_card = Card(11, "Hearts") # Exception: Number wasn't 2-10 or J, Q, K, or A. -my_card = Card(10, "Stars") # Suit wasn't one of: clubs, hearts, spades, or diamonds. -my_card = Card("10", "Spades") -my_card = Card("2", "clubs") -my_card = Card("2", "club") # Suit wasn't one of: clubs, hearts, spades, or diamonds. ----- - -[TIP] -==== -To raise an exception, you can do: - -[source,python] ----- -raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") ----- -==== - -[TIP] -==== -Here is some starter code to fill in: - -[source,python] ----- -class Card: - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 11, "q": 12, "k": 13, "a": 14} - def __init__(self, number, suit): - # if number is not a valid 2-10 or j, q, k, or a - # raise Exception - - # else set the value for self.number to str(self.number) - # if the suit.lower() isn't a valid suit: clubs hearts diamonds spades - # raise Exception - - # else, set the value for self.suit to suit.lower() ----- -==== - -[IMPORTANT] -==== -Accept both upper and lowercase variants for both `suit` and `number`. To do this, convert any input to lowercase prior to processing/saving. For `number`, you can do `str(num).lower()`. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Usually when we talk about a particular card, we say it is a "Four of Spades" or "King of Hearts", etc. Right now, if you `print(my_card)` you will get something like `<__main__.Card object at 0x7fccd0523208>`. Not very useful. We have a https://docs.python.org/3/reference/datamodel.html?highlight=\\__str__#object.\\__str__[dunder method] for that! - -Implement the `\\__str__` dunder method to work like this: - -[source,python] ----- -print(Card("10", "Spades")) # 10 of spades -print(Card("2", "clubs")) # 2 of clubs ----- - -Another, closely related dunder method is called https://docs.python.org/3/reference/datamodel.html?highlight=\\__str__#object.\\__repr__[`\\__repr__`] -- short for representation. This is similar to `\\__str__` in that it should print the code used to create the object being printed. So for our examples: - -[source,python] ----- -repr(Card("10", "Spades")) # Card(str(10), "spades") -repr(Card("2", "clubs")) # Card(str(2), "clubs") ----- - -Implement both dunder methods to function as exemplified. - -[TIP] -==== -https://medium.com/python-features/magic-methods-demystified-3c9e93144bf7[This] article has examples of both `\\__str__` and `\\__repr__`. -==== - - - - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -It is natural that we should be able to compare cards, after all thats necessary to play nearly any game. Typically there are two ways to "sort" cards. Ace high, or ace low. Ace high is when the Ace represents the highest card. - -Implement the https://docs.python.org/3/reference/datamodel.html?highlight=\\__str__#object.\\__lt__[following] dunder methods to enable comparison of cards where ace is high: - -- `\\__eq__` -- `\\__lt__` -- `\\__gt__` - -Make sure the following examples work: - -```{python, eval=F} -card1 = Card(2, "spades") -card2 = Card(3, "hearts") -card3 = Card(3, "diamonds") -card4 = Card(3, "Hearts") -card5 = Card("A", "Spades") -card6 = Card("A", "Hearts") -card7 = Card("K", "Diamonds") -print(card1 < card2) # True -print(card1 < card3) # True -print(card2 == card3) # True -print(card2 == card4) # True -print(card3 < card4) # False -print(card4 < card3) # False -print(card5 > card4) # True -print(card5 > card6) # False -print(card5 == card6) # True -print(card7 < card5) # True -print(card7 > card1) # True -``` - -[IMPORTANT] -==== -Two cards are deemed equal if they have the same number, regardless of their suits. -==== - -[TIP] -==== -There are many ways to deal with comparing the "JKQA" against other numbers. One possibility is to have a dict that maps the value of the card to it's numeric value. -==== - -[TIP] -==== -https://www.tutorialspoint.com/How-to-implement-Python-lt-gt-custom-overloaded-operators[This] example shows a short example of how to implement these dunder methods. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -We've provided you with the code below: - -[source,python] ----- -class Deck: - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] ----- - -As you can see, we are working on building a `Deck` class. Use the code provided and create an instance of a new `Deck` called `lucky_deck`. Print the cards out to make sure it looks right. Make sure that the `Deck` has the correct number of cards, print the `len` of the `Deck`. What happens? Instead of trying to find the length, try to access and print a single card: `print(lucky_deck[10])`. What happens? - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining what happens when you try doing what we ask. -==== - -=== Question 5 - -As it turns out, we can fix both of the issues we ran into in question (4). To fix the issue with `len`, implement the https://docs.python.org/3/reference/datamodel.html#object.\\__len__[`\\__len__`] dunder method. Does it work now? - -To fix the indexing issue, implement the https://docs.python.org/3/reference/datamodel.html#object.\\__getitem__[`\\__getitem__`] dunder method. Test out (but don't forget to re-run to get an updated `lucky_deck`): - -[source,python] ----- -# make sure to re-create your Deck below this line -# these should both work now -len(lucky_deck) # 52 -print(lucky_deck[10]) # q of clubs ----- - -[TIP] -==== -https://medium.com/python-features/magic-methods-demystified-3c9e93144bf7[This] article has examples of both `\\__len__` and `\\__getitem__`. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project13.adoc deleted file mode 100644 index aa9e597a2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project13.adoc +++ /dev/null @@ -1,430 +0,0 @@ -= STAT 19000: Project 13 -- Spring 2021 - -**Motivation:** We'd be remiss spending almost an entire semester solving data driven problems in python without covering the basics of classes. Whether or not you will ever choose to use this feature in your work, it is best to at least understand some of the basics so you can navigate libraries and other code that does use it. - -**Context:** We've spent nearly the entire semester solving data driven problems in python, and now we are going to learn about one of the primary features in python: classes. Python is an object oriented programming language, and as such, much of python, and the libraries you use in python are objects which have properties and methods. In this project we will explore some of the terminology and syntax relating to classes. - -**Scope:** python - -.Learning objectives -**** -- Explain the basics of object oriented programming, and what a class is. -- Use classes to solve a data-driven problem. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -This is the continuation of the previous project. In this project we will learn about classes by simulating a simplified version of Blackjack! Don't worry, while this may sound intimidating, there is very little coding involved, and much of the code will be provided to you to use! - -== Questions - -=== Question 1 - -In the previous project, we built a `Deck` and `Card` class. What is one other very common task that people do with decks of cards? Shuffle! There is a function in Python called `shuffle`. It can be used like: - -[source,python] ----- -from random import shuffle -my_list = [1,2,3] -print(my_list) -shuffle(my_list) -print(my_list) ----- - -Run the `Deck` and `Card` code from the previous project. Create a `Deck` and try to shuffle it. What happens? - -To fix this, we can implement the `__setitem__` dunder method. This dunder method allows us to "set" a value, much in the same way `__getitem__` allows us to "get" a value. Re-run your `Deck` class and try to shuffle again and print out the first couple of cards to ensure it is truly shuffled. - -.Items to submit -==== -- 1-2 sentences explaining what happens. -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Let's take one last look at the `Card` class. In Blackjack, one thing you need to be able to do is count the value of the cards in your hand. Wouldn't it be convenient if we were able to add `Card` objects together like the following? - -[source,python] ----- -print(Card("2", "clubs") + Card("k", "diamonds")) ----- - -In order to do this, implement the `\\__add__` dunder method, and test out the following. - - - -[source,python] ----- -print(Card("2", "clubs") + Card("k", "diamonds")) # 15 -print(Card("k", "hearts") + Card("q", "hearts")) # 25 -print(Card("k", "diamonds") + Card("a", "spades") + Card("5", "hearts")) # what happens with this last example ----- - -What happens with the last example? The reason this happens is that the first 2 cards are added together without any issue. Then, we add the final card and everything breaks down. Any guesses why? The reason is that the result of adding the first 2 cards is an integer, 27, and we try to then add $$27 + Card("5", "hearts")$$ and Python doesn't know what to do! As usual, there is a dunder method for that. - -Implement `\\__radd__` and try again. `\\__radd__` will look nearly identical to `\\__add__`, but where you previously added together the result of 2 dictionary lookups, we now just add the plain argument to 1 dictionary lookup. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Okay, rather than force you to stumble through writing a bunch of new code, we are going to provide you with a good amount of code, and have you read through it and digest it. We've provided: - -- A `Card` class with an updated `\\__eq__` method, and updated card values to fit blackjack. Make sure to add your `\\__add__` and `\\__radd__` methods to the provided `Card` class. They are necessary to make everything work. -- An updated `Deck` class with a `draw` method that allows the user to draw a card from the deck and keep track of how many cards are drawn. Make sure to add your `\\__setitem__` and `\\__getitem__` methods to the provided `Deck` class. -- A `Hand` class that represents a "hand" of cards. -- A `Player` class that represents a player. -- A `BlackJack` class that represents a single game of Blackjack. - -=== `Card` - -Make sure to add your `\\__add__` and `\\__radd__` methods to the provided `Card` class. They are necessary to make everything work. - -Methods from the previous project will be added below Saturday morning, or at the latest Monday morning. - -[source,python] ----- -class Card: - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 10, "q": 10, "k": 10, "a": 11} - def __init__(self, number, suit): - if str(number).lower() not in [str(num) for num in range(2, 11)] + list("jqka"): - raise Exception("Number wasn't 2-10 or J, Q, K, or A.") - else: - self.number = str(number).lower() - if suit.lower() not in ["clubs", "hearts", "diamonds", "spades"]: - raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") - else: - self.suit = suit.lower() - - def __str__(self): - return(f'{self.number} of {self.suit.lower()}') - - def __repr__(self): - return(f'Card(str({self.number}), "{self.suit}")') - - def __eq__(self, other): - if isinstance(other, type(self)): - if self.number == other.number: - return True - else: - return False - else: - if self.number == other: - return True - else: - return False - - def __lt__(self, other): - if self._value_dict[self.number] < self._value_dict[other.number]: - return True - else: - return False - - def __gt__(self, other): - if self._value_dict[self.number] > self._value_dict[other.number]: - return True - else: - return False ----- - -{sp}+ - -=== `Deck` - -Make sure to add your `\\__setitem__` and `\\__getitem__` methods to the provided `Deck` class. - -Methods from the previous project will be added below Saturday morning, or at the latest Monday morning. - -[source,python] ----- -class Deck: - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] - self._drawn = 0 - - def __str__(self): - return str(self.cards) - - def __len__(self): - return len(self.cards) - self._drawn - - def draw(self, number_cards = 1): - try: - drawn_cards = self.cards[self._drawn:(self._drawn+number_cards)] - except: - print(f"Can't draw anymore cards, deck empty.") - - self._drawn += number_cards - return drawn_cards ----- - -{sp}+ - -=== `Hand` - -[source,python] ----- -import queue -class Hand: - def __init__(self, *cards): - self.cards = [card for card in cards] - - def __str__(self): - vals = [str(val) for val in self.cards] - return(', '.join(vals)) - - def __repr__(self): - vals = [repr(val) for val in self.cards] - return(', '.join(vals)) - - def __len__(self): - return len(self.cards) - - def __getitem__(self, key): - return self.cards[key] - - def __setitem__(self, key, value): - self.cards[key] = value - - def sum(self): - # remember, when we compare to Ace of Hearts, we are really only comparing the values, - # and ignoring the suit. - number_aces = sum(1 for card in self.cards if card == Card("a", "hearts")) - non_ace_sum = sum(card for card in self.cards if card != Card("a", "hearts")) - - if number_aces == 0: - return non_ace_sum - - else: - # only 2 options 1 ace is 11 the rest 1 or all 1 - high_option = non_ace_sum + number_aces*1 + 10 - low_option = non_ace_sum + number_aces*1 - - if high_option <= 21: - return high_option - else: - return low_option - - def add(self, *cards): - self.cards = self.cards + list(cards) - return self - - def clear(self): - self.cards = [] ----- - -{sp}+ - -=== `Player` - -[source,python] ----- -class Player: - def __init__(self, name, strategy = None, dealer = False): - self.name = name - self.hand = Hand() - self.dealer = dealer - self.wins = 0 - self.draws = 0 - self.losses = 0 - if not self.dealer and not strategy: - print(f"Non-dealer MUST have strategy.") - - self.strategy = strategy - - def __str__(self): - summary = f'''{self.name} ------------- -Wins: {self.wins/(self.wins+self.losses+self.draws):.2%} -Losses: {self.losses/(self.wins+self.losses+self.draws):.2%} -Draws: {self.draws/(self.wins+self.losses+self.draws):.2%}''' - return summary - - def cards(self): - if self.dealer: - return [list(self.hand.cards)[0], "Face down"] - else: - return self.hand ----- - - -=== `BlackJack` - -[source,python] ----- -import sys -class BlackJack: - def __init__(self, *players, dealer = None): - self.players = players - self.deck = Deck() - self.dealt = False - if not dealer: - self.dealer = Player('dealer', dealer=True) - - def deal(self): - # shuffle the deck - shuffle(self.deck) - - # we are ignoring dealing order and dealing to the dealer - # first - for _ in range(2): - self.dealer.hand.add(*self.deck.draw()) - - # deal 2 cards to each player - for player in self.players: - - # first, clear out the players hands in case they've played already - player.hand.clear() - for _ in range(2): - player.hand.add(*self.deck.draw()) - - self.dealt = True - - def play(self): - - # make sure we've dealt - if not self.dealt: - sys.exit("You MUST deal the cards before playing.") - - # if dealer has face up ace or 10, checks to make sure - # doesn't have blackjack. - # remember, when we compare to Ace of Hearts, we are really only comparing the values, - # and ignoring the suit. - face_value_ten = (Card("10", "hearts"), Card("j", "hearts"), Card("q", "hearts"), Card("k", "hearts"), Card("a", "hearts")) - if self.dealer.cards()[0] in face_value_ten: - - if self.dealer.hand.sum() == 21: - # winners get a draw, losers - # get a loss - for player in self.players: - if player.hand.sum() == 21: - player.draws += 1 - else: - player.losses += 1 - - return "GAME OVER" - - # if the dealer doesn't win with a blackjack, - # the players now know the dealer doesn't - # have a blackjack - - - # if the dealer doesn't have blackjack - for player in self.players: - # players play using their strategy until they hold - while True: - player_move = player.strategy(self, player) - if player_move == "hit": - player.hand.add(*self.deck.draw()) - else: - break - # dealer draws until >= 17 - while self.dealer.hand.sum() < 17: - self.dealer.hand.add(*self.deck.draw()) - # if the dealer gets 21, players who get 21 draw - # other lose - if self.dealer.hand.sum() == 21: - for player in self.players: - if player.hand.sum() == 21: - player.draws += 1 - else: - player.losses += 1 - # otherwise, dealer has < 21, anyone with more wins, same draws, - # and less loses - elif self.dealer.hand.sum() < 21: - for player in self.players: - if player.hand.sum() > 21: - # player busts - player.losses += 1 - elif player.hand.sum() > self.dealer.hand.sum(): - # player wins - player.wins += 1 - elif player.hand.sum() == self.dealer.hand.sum(): - # player ties - player.draws += 1 - else: - # player loses - player.losses += 1 - # if dealer busts, players who didn't bust, win - # players who busted, lose -- this is the house's edge - else: - for player in self.players: - if player.hand.sum() < 21: - # player won - player.wins += 1 - else: - # player busted - player.losses += 1 - - return "GAME OVER" ----- - -Read and understand the `Hand` class. Create a hand containing the: Ace of Diamonds, King of Hearts, Ace of Spades. Print the sum of the `Hand`. Add the 8 of Hearts to your `Hand`, and print the new sum. Do things appear to be working okay? - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -If you take a look at the `Player` class and inside of the `BlackJack` class, you may notice something we refer to as a "strategy". We define a strategy as any function that accepts a `BlackJack` object, and a `Player` object, and returns either a `str` "hit" or a `str` "hold". Here are a couple examples of "strategies": - -[source,python] ----- -def always_hit_once(my_blackjack_game, me) -> str: - """ - This is a simple strategy where the player - always hits once. - """ - if len(me.hand) == 3: - return "hold" - else: - return "hit" ----- - -[source,python] ----- -def seventeen_plus(my_blackjack_game, me) -> str: - """ - This is a simple strategy where the player holds if the sum - of cards is 17+, and hits otherwise. - """ - if me.hand.sum() >= 17: - return "hold" - else: - return "hit" ----- - -When you create a `Player` object, you provide a strategy as an argument, and that player uses the strategy inside of a `BlackJack` game. - -Create 2 or more `Player` objects using any of the provided strategies. Create 1000 or more `BlackJack` games with those players, and play the games. Print the results for each player. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create your own strategy, make new games, and see how your strategy compares to the other provided strategies. Optionally, create a plot that illustrates the differences in the strategy. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- (Optionally, 0 pts) The plot described. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project14.adoc deleted file mode 100644 index 7ebf77301..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/19000/19000-s2021-project14.adoc +++ /dev/null @@ -1,131 +0,0 @@ -= STAT 19000: Project 14 -- Spring 2021 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you've learned, and review topics and languages in a generic way. - -**Context:** We are on the last project where we will leave it up to you on how to solve the problems presented. - -**Scope:** python, r, bash, unix, computers - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -- `/class/datamine/data/disney` -- `/class/datamine/data/movies_and_tv/imdb.db` -- `/class/datamine/data/amazon/music.txt` -- `/class/datamine/data/craigslist/vehicles.csv` -- `/class/datamine/data/flights/2008.csv` - -== Questions - -[IMPORTANT] -==== -Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don't feel limited by one language, you can use different languages to answer different questions. If you are feeling bold, you can also try answering the questions using all languages! -==== - -=== Question 1 - -What percentage of flights in 2008 had a delay due to the weather? Use the `/class/datamine/data/flights/2008.csv` dataset to answer this question. - -[TIP] -==== -Consider a flight to have a weather delay if `WEATHER_DELAY` is greater than 0. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - - -=== Question 2 - -Which listed manufacturer has the most expensive previously owned car listed in Craiglist? Use the `/class/datamine/data/craigslist/vehicles.csv` dataset to answer this question. Only consider listings that have listed price less than $500,000 _and_ where manufacturer information is available. - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 3 - -What is the most common and least common `type` of title in imdb ratings? Use the `/class/datamine/data/movies_and_tv/imdb.db` dataset to answer this question. - -[TIP] -==== -Use the `titles` table. -==== - -[TIP] -==== -Don't know how to use SQL yet? To get this data into an R data.frame, for example: - -[source,r] ----- -library(tidyverse) -con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") -myDF <- tbl(con, "titles") ----- -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 4 - -What percentage of music reviews contain the words 'hate' or 'hated', and what percentage contain the words 'love' or 'loved'? Use the `/class/datamine/data/amazon/music.txt` dataset to answer this question. - -[TIP] -==== -It _may_ take a minute to run, depending on the tool you use. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 5 - -What is the best time to visit Disney? Use the data provided in `/class/datamine/data/disney` to answer the question. - -First, you will need determine what you will consider "time", and the criteria you will use. See below some examples. Don't feel limited by them! Be sure to explain your criteria, use the data to investigate, and determine the best time to visit! Write 1-2 sentences commenting on your findings. - -- As Splash Mountain is my favorite ride, my criteria is the smallest monthly average wait times for Splash Mountain between the years 2017 and 2019. I'm only considering these years as I expect them to be more representative. My definition of "best time" will be the "best months". -- Consider "best times" the days of the week that have the smallest wait time on average for all rides, or for certain favorite rides. -- Consider "best times" the season of the year where the park is open for longer hours. -- Consider "best times" the weeks of the year with smallest average high temperature in the day. - -.Items to submit -==== -- The code used to solve the question. -- 1-2 sentences detailing the criteria you are going to use, its logic, and your defition for "best time". -- The answer to the question. -- 1-2 sentences commenting on your answer. -==== - -=== Question 6 - -Finally, use RMarkdown (and its formatting) to outline 3 things you learned this semester from The Data Mine. For each thing you learned, give a mini demonstration where you highlight with text and code the thing you learned, and why you think it is useful. If you did not learn anything this semester from The Data Mine, write about 3 things you _want_ to learn. Provide examples that demonstrate _what_ you want to learn and write about _why_ it would be useful. - -[IMPORTANT] -==== -Make sure your answer to this question is formatted well and makes use of RMarkdown. -==== - -.Items to submit -==== -- 3 clearly labeled things you learned. -- 3 mini-demonstrations where you highlight with text and code the thin you learned, and why you think it is useful. -OR -- 3 clearly labeled things you _want_ to learn. -- 3 examples demonstrating _what_ you want to learn, with accompanying text explaining _why_ you think it would be useful. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project01.adoc deleted file mode 100644 index 08806397f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project01.adoc +++ /dev/null @@ -1,181 +0,0 @@ -= STAT 29000: Project 1 -- Spring 2021 - -**Motivation:** Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like https://arrow.apache.org/faq/[parquet] and https://developers.google.com/protocol-buffers[protobufs] are becoming more common. - -**Context:** In previous semesters we've explored XML. In this project we will refresh our skills and, rather than exploring XML in R, we will use the `lxml` package in Python. This is the first project in a series of 5 projects focused on web scraping in R and Python. - -**Scope:** python, XML - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Match XML terms to sections of XML demonstrating working knowledge. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/apple/health/watch_dump.xml` - -== Resources - -We realize that for many of you this is a big "jump" right into Python. Don't worry! Python is a very intuitive language with a clean syntax. It is easy to read and write. We will do our very best to keep things as straightforward as possible, especially in the early learning stages of the class. - -We will be actively updating the examples book with videos and more examples throughout the semester. Ask a question in Piazza and perhaps we will add an example straight to the book to help out. - -Some potentially useful resources for the semester include: - -- The STAT 19000 projects. We are easing 19000 students into Python and will post solutions each week. It would be well worth 10 minutes to look over the questions and solutions each week. -- https://towardsdatascience.com/cheat-sheet-for-python-dataframe-r-dataframe-syntax-conversions-450f656b44ca[Here] is a decent cheat sheet that helps you quickly get an idea of how to do something you know how to do in R, in Python. -- The Examples Book -- updating daily with more examples and videos. Be sure to click on the "relevant topics" links as we try to point you to topics with examples that should be particularly useful to solve the problems we assign. - -== Questions - -[IMPORTANT] -==== -It would be well worth your time to read through the XML section of the book, as well as take the time to work through https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[`pandas` 10 minute intro]. -==== - -=== Question 1 - -A good first step when working with XML is to get an idea how your document is structured. Normally, there should be good documentation that spells this out for you, but it is good to know what to do when you _don't_ have the documentation. Start by finding the "root" node. What is the name of the root node of the provided dataset? - -[TIP] -==== -Make sure to import the `lxml` package first: - -[source,python] ----- -from lxml import etree ----- -==== - -Here are two videos about running Python in RStudio... - -++++ - -++++ - -++++ - -++++ - -...and here is a video about XML scraping in Python: - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Remember, XML can be nested. In question (1) we figured out what the root node was called. What are the names of the next "tier" of elements? - -[TIP] -==== -Now that we know the root node, you could use the root node name as a part of your xpath expression. -==== - -[TIP] -==== -As you may have noticed in question (1) the `xpath` method returns a list. Sometimes this list can contain many repeated tag names. Since our goal is to see the names of the second "tier" elements, you could convert the resulting `list` to a `set` to quickly see the unique list as a `set` only contains unique values. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Continue to explore each "tier" of data until there isn't any left. Name the "full paths" of all of the "last tier" tags. - -[TIP] -==== -Let's say a "last tier" tag is just a path where there are no more nested elements. For example, `/HealthData/Workout/WorkoutRoute/FileReference` is a "last tier" tag. If you try and get the nested elements for it, they don't exist: - -[source,python] ----- -tree.xpath("/HealthData/Workout/WorkoutRoute/FileReference/*") ----- -==== - -[TIP] -==== -Here are 3 of the 7 "full paths": - -```` -/HealthData/Workout/WorkoutRoute/FileReference -/HealthData/Record/MetadataEntry -/HealthData/ActivitySummary -```` -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -At this point in time you may be asking yourself "but where is the data"? Depending on the structure of the XML file, the data could either be between tags like: - -[source,HTML] ----- -mydata ----- - -Or, it could be in an attribute: - -[source,HTML] ----- -What is cat spelled backwards? ----- - -Collect the "ActivitySummary" data, and convert the list of dicts to a `pandas` DataFrame. The following is an example of converting a list of dicts to a `pandas` DataFrame called `myDF`: - -[source,python] ----- -import pandas as pd -list_of_dicts = [] -list_of_dicts.append({'columnA': 1, 'columnB': 2}) -list_of_dicts.append({'columnB': 4, 'columnA': 1}) -myDF = pd.DataFrame(list_of_dicts) ----- - -[TIP] -==== -It is important to note that an element's "attrib" attribute looks and feels like a `dict`, but it is actually a `lxml.etree._Attrib`. If you try to convert a list of `lxml.etree._Attrib` to a `pandas` DataFrame, it will not work out as you planned. Make sure to first convert each `lxml.etree._Attrib` to a `dict` before converting to a DataFrame. You can do so like: - -[source,python] ----- -# this will convert a single `lxml.etree._Attrib` to a dict -my_dict = dict(my_lxml_etree_attrib) ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -`pandas` is a Python package that provides the DataFrame and Series classes. A DataFrame is very similar to a data.frame in R and can be used to manipulate the data within very easily. A Series is the class that handles a single column of a DataFrame. Go through the https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[`pandas` in 10 minutes] page from the official documentation. Sort, find, and print the top 5 rows of data based on the "activeEnergyBurned" column. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project02.adoc deleted file mode 100644 index 2e2326176..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project02.adoc +++ /dev/null @@ -1,137 +0,0 @@ -= STAT 29000: Project 2 -- Spring 2021 - -**Motivation:** Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering. - -**Context:** In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from The New York Times, and parse through our newly scraped data using xpath expressions. - -**Scope:** python, web scraping, xml - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -You will be extracting your own data from online in this project. There is no base dataset. - -== Questions - -=== Question 1 - -The New York Times is one of the most popular newspapers in the United States. Open a modern browser (preferably Firefox or Chrome), and navigate to https://nytimes.com. - -By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website. You can either right click and click on "view page source", which will pull up a page full of HTML used to render the page. Alternatively, if you want to focus on a single element, an article title, for example, right click on the article title and click on "inspect element". This will pull up an inspector that allows you to see portions of the HTML. - -Click around the website and explore the HTML however you see fit. Open a few front page articles and notice how most articles start with a bunch of really important information, namely: an article title, summary, picture, picture caption, picture source, author portraits, authors, and article datetime. - -For example: - -https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html - -![](./images/nytimes_image.jpg) - -Copy and paste the **h1** element (in its entirety) containing the article title (for the article provided) in an HTML code chunk. Do the same for the same article's summary. - -++++ - -++++ - -.Items to submit -==== -- 2 code chunks containing the HTML requested. -==== - -=== Question 2 - -In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for _new_ data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest. - -For example, given a link to a new nytimes article, do you think you could isolate the article title by using the `id="link-4686dc8b"` attribute of the **h1** tag? Maybe, or maybe not, but it sure seems like "link-4686dc8b" might be unique to the article and not able to be used given a new article. - -Write an xpath expression to isolate the article title, and another xpath expression to isolate the article summary. - -[IMPORTANT] -==== -You do _not_ need to test your xpath expression yet, we will be doing that shortly. -==== - -.Items to submit -==== -- Two xpath expressions in an HTML code chunk. -==== - -=== Question 3 - -Use the `requests` package to scrape the webpage containing our article from questions (1) and (2). Use the `lxml.html` package and the `xpath` method to test out your xpath expressions from question (2). Did they work? Print the content of the elements to confirm. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Here are a list of article links from https://nytimes.com: - -https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html - -https://www.nytimes.com/2021/01/06/technology/personaltech/tech-2021-augmented-reality-chatbots-wifi.html - -https://www.nytimes.com/2021/01/13/movies/letterboxd-growth.html - -Write a function called `get_article_and_summary` that accepts a string called `link` as an argument, and returns both the article title and summary. Test `get_article_and_summary` out on each of the provided links: - -[source,python] ----- -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html') -print(f'Title: {title}, Summary: {summary}') -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/06/technology/personaltech/tech-2021-augmented-reality-chatbots-wifi.html') -print(f'Title: {title}, Summary: {summary}') -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/13/movies/letterboxd-growth.html') -print(f'Title: {title}, Summary: {summary}') ----- - -[TIP] -==== -The first line of your function should look like this: - -`def get_article_and_summary(myURL: str) -> (str, str):` -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -In question (1) we mentioned a myriad of other important information given at the top of most New York Times articles. Choose **one** other listed pieces of information and copy, paste, and update your solution to question (4) to scrape and return those chosen pieces of information. - -[IMPORTANT] -==== -If you choose to scrape non-textual data, be sure to return data of an appropriate type. For example, if you choose to scrape one of the images, either print the image or return a PIL object. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project03.adoc deleted file mode 100644 index deef4c711..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project03.adoc +++ /dev/null @@ -1,190 +0,0 @@ -= STAT 29000: Project 3 -- Spring 2021 - -**Motivation:** Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from https://trulia.com. - -**Context:** In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let's you interact with a browser, selenium. - -**Scope:** python, web scraping, selenium - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Visit https://trulia.com. Many websites have a similar interface, i.e. a bold and centered search bar for a user to interact with. Using `selenium` write Python code that that first finds the `input` element, and then types "West Lafayette, IN" followed by an emulated "Enter/Return". Confirm you code works by printing the url after that process completes. - -[TIP] -==== -You will want to use `time.sleep` to pause a bit after the search so the updated url is returned. -==== - -++++ - -++++ - -That video is already relevant for Question 2 too. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Use your code from question (1) to test out the following queries: - -- West Lafayette, IN (City, State) -- 47906 (Zip) -- 4505 Kahala Ave, Honolulu, HI 96816 (Full address) - -If you look closely you will see that there are patterns in the url. For example, the following link would probably bring up homes in Crawfordsville, IN: https://trulia.com/IN/Crawfordsville. With that being said, if you only had a zip code, like 47933, it wouldn't be easy to guess https://www.trulia.com/IN/Crawfordsville/47933/, hence, one reason why the search bar is useful. - -If you used xpath expressions to complete question (1), instead use a https://selenium-python.readthedocs.io/locating-elements.html#locating-elements[different method] to find the `input` element. If you used a different method, use xpath expressions to complete question (1). - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Let's call the page after a city/state or zipcode search a "sales page". For example: - -![](./images/trulia.png) - -Use `requests` to scrape the entire page: https://www.trulia.com/IN/West_Lafayette/47906/. Use `lxml.html` to parse the page and get all of the `img` elements that make up the house pictures on the left side of the website. - -[IMPORTANT] -==== -Make sure you are actually scraping what you think you are scraping! Try printing your html to confirm it has the content you think it should have: - -[source,python] ----- -import requests -response = requests.get(...) -print(response.text) ----- -==== - -[TIP] -==== -Are you human? Depends. Sometimes if you add a header to your request, it won't ask you if you are human. Let's pretend we are Firefox: - -[source,python] ----- -import requests -my_headers = {'User-Agent': 'Mozilla/5.0'} -response = requests.get(..., headers=my_headers) ----- -==== - -Okay, after all of that work you may have discovered that only a few images have actually been scraped. If you cycle through all of the `img` elements and try to print the value of the `src` attribute, this will be clear: - -[source,python] ----- -import lxml.html -tree = lxml.html.fromstring(response.text) -elements = tree.xpath("//img") -for element in elements: - print(element.attrib.get("src")) ----- - -This is because the webpage is not immediately, _completely_ loaded. This is a common website behavior to make things appear faster. If you pay close to when you load https://www.trulia.com/IN/Crawfordsville/47933/, and you quickly scroll down, you will see images still needing to finish rendering all of the way, slowly. What we need to do to fix this, is use `selenium` (instead of `lxml.html`) to behave like a human and scroll prior to scraping the page! Try using the following code to slowly scroll down the page before finding the elements: - -[source,python] ----- -# driver setup and get the url -# Needed to get the window size set right and scroll in headless mode -myheight = driver.execute_script('return document.body.scrollHeight') -driver.set_window_size(1080,myheight+100) -def scroll(driver, scroll_point): - driver.execute_script(f'window.scrollTo(0, {scroll_point});') - time.sleep(5) - -scroll(driver, myheight*1/4) -scroll(driver, myheight*2/4) -scroll(driver, myheight*3/4) -scroll(driver, myheight*4/4) -# find_elements_by_* ----- - -[TIP] -==== -At the time of writing there should be about 86 links to images of homes. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Write a function called `avg_house_cost` that accepts a zip code as an argument, and returns the average cost of the first page of homes. Now, to make this a more meaningful statistic, filter for "3+" beds and _then_ find the average. Test `avg_house_cost` out on the zip code `47906` and print the average costs. - -[IMPORTANT] -==== -Use `selenium` to "click" on the "3+ beds" filter. -==== - -[TIP] -==== -If you get an error that tells you `button` is not clickable because it is covered by an `li` element, try clicking on the `li` element instead. -==== - -[TIP] -==== -You will want to wait a solid 10-15 seconds for the sales page to load before trying to select or click on anything. -==== - -[TIP] -==== -Your results may end up including prices for "Homes Near \". This is okay. Even better if you manage to remove those results. If you _do_ choose to remove those results, take a look at the `data-testid` attribute with value `search-result-list-container`. Perhaps only selecting the children of the first element will get the desired outcome. -==== - -[TIP] -==== -You can use the following code to remove the non-numeric text from a string, and then convert to an integer: - -[source,python] ----- -import re -int(re.sub("[^0-9]", "", "removenon45454_numbers$")) ----- -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Get creative. Either add an interesting feature to your function from (4), or use `matplotlib` to generate some sort of accompanying graphic with your output. Make sure to explain what your additi - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project04.adoc deleted file mode 100644 index 1d351dda9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project04.adoc +++ /dev/null @@ -1,293 +0,0 @@ -= STAT 29000: Project 4 -- Spring 2021 - -**Motivation:** In this project we will continue to hone your web scraping skills, introduce you to some "gotchas", and give you a little bit of exposure to a powerful tool called cron. - -**Context:** We are in the second to last project focused on web scraping. This project will introduce some supplementary tools that work well with web scraping: cron, sending emails from Python, etc. - -**Scope:** python, web scraping, selenium, cron - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use the beautifulsoup4 package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Check out the following website: https://project4.tdm.wiki - -Use `selenium` to scrape and print the 6 colors of pants offered. - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -You _may_ have to interact with the webpage for certain elements to render. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Websites are updated frequently. You can imagine a scenario where a change in a website is a sign that there is more data available, or that something of note has happened. This is a fake website designed to help students emulate real changes to a website. Specifically, there is one part of the website that has two possible states (let's say, state `A` and state `B`). Upon refreshing the website, or scraping the website again, there is an $$x%$$ chance that the website will be in state `A` and a $$1-x%$$ chance the website will be in state `B`. - -Describe the two states (the thing (element or set of elements) that changes as you refresh the page), and scrape the website enough to estimate $$x$$. - -++++ - -++++ - -[TIP] -==== -You _will_ need to interact with the website to "see" the change. -==== - -[TIP] -==== -Since we are just asking about a state, and not any specific element, you could use the `page_source` attribute of the `selenium` driver to scrape the entire page instead of trying to use xpath expressions to find a specific element. -==== - -[TIP] -==== -Your estimate of $$x$$ does not need to be perfect. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- What state `A` and `B` represent. -- An estimate for `x`. -==== - -=== Question 3 - -Dig into the changing "thing" from question (2). What specifically is changing? Use selenium and xpath expressions to scrape and print the content. What are the two possible values for the content? - -++++ - -++++ - -[TIP] -==== -Due to the changes that occur when a button is clicked, I'd highly advice you to use the `data-color` attribute in your xpath expression instead of `contains(text(), 'blahblah')`. -==== - -[TIP] -==== -`parent::` and `following-sibling::` may be useful https://www.w3schools.com/xml/xpath_axes.asp[xpath axes] to use. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -The following code allows you to send an email using Python from your Purdue email account. Replace the username and password with your own information and send a test email to yourself to ensure that it works. - -++++ - -++++ - -[IMPORTANT] -==== -Do **NOT** include your password in your homework submission. Any time you need to type your password in you final submission just put something like "SUPERSECRETPASSWORD" or "MYPASSWORD". -==== - -[TIP] -==== -To include an image (or screenshot) in RMarkdown, try `![](./my_image.png)` where `my_image.png` is inside the same folder as your `.Rmd` file. -==== - -[TIP] -==== -The spacing and tabs near the `message` variable are very important. Make sure to copy the code exactly. Otherwise, your subject may not end up in the subject of your email, or the email could end up being blank when sent. -==== - -[TIP] -==== -Questions 4 and 5 were inspired by examples and borrowed from the code found at the https://realpython.com/python-send-email/[Real Python] website. -==== - -[source,python] ----- -def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message): - import smtplib, ssl - from email.mime.text import MIMEText - from email.mime.multipart import MIMEMultipart - - message = MIMEMultipart("alternative") - message["Subject"] = my_subject - message["From"] = my_purdue_email - message["To"] = to - - # Create the plain-text and HTML version of your message - text = f'''\ -Subject: {my_subject} -To: {to} -From: {my_purdue_email} - -{my_message}''' - html = f'''\ - - - {my_message} - - -''' - # Turn these into plain/html MIMEText objects - part1 = MIMEText(text, "plain") - part2 = MIMEText(html, "html") - - # Add HTML/plain-text parts to MIMEMultipart message - # The email client will try to render the last part first - message.attach(part1) - message.attach(part2) - - context = ssl.create_default_context() - with smtplib.SMTP("smtp.purdue.edu", 587) as server: - server.ehlo() # Can be omitted - server.starttls(context=context) - server.ehlo() # Can be omitted - server.login(my_purdue_email, my_password) - server.sendmail(my_purdue_email, to, message.as_string()) - -# this sends an email from kamstut@purdue.edu to mdw@purdue.edu -# replace supersecretpassword with your own password -# do NOT include your password in your homework submission. -send_purdue_email("kamstut@purdue.edu", "supersecretpassword", "mdw@purdue.edu", "put subject here", "put message body here") ----- - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- Screenshot showing your received the email. -==== - -=== Question 5 - -The following is the content of a new Python script called `is_in_stock.py`: - -[source,python] ----- -def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message): - import smtplib, ssl - from email.mime.text import MIMEText - from email.mime.multipart import MIMEMultipart - - message = MIMEMultipart("alternative") - message["Subject"] = my_subject - message["From"] = my_purdue_email - message["To"] = to - - # Create the plain-text and HTML version of your message - text = f'''\ -Subject: {my_subject} -To: {to} -From: {my_purdue_email} - -{my_message}''' - html = f'''\ - - - {my_message} - - -''' - # Turn these into plain/html MIMEText objects - part1 = MIMEText(text, "plain") - part2 = MIMEText(html, "html") - - # Add HTML/plain-text parts to MIMEMultipart message - # The email client will try to render the last part first - message.attach(part1) - message.attach(part2) - - context = ssl.create_default_context() - with smtplib.SMTP("smtp.purdue.edu", 587) as server: - server.ehlo() # Can be omitted - server.starttls(context=context) - server.ehlo() # Can be omitted - server.login(my_purdue_email, my_password) - server.sendmail(my_purdue_email, to, message.as_string()) - -def main(): - # scrape element from question 3 - - # does the text indicate it is in stock? - - # if yes, send email to yourself telling you it is in stock. - - # otherwise, gracefully end script using the "pass" Python keyword -if __name__ == "__main__": - main() ----- - -First, make a copy of the script in your `$HOME` directory: - -[source,bash] -cp /class/datamine/data/scraping/is_in_stock.py $HOME/is_in_stock.py -``` - -If you now look in the "Files" tab in the lower right hand corner of RStudio, and click the refresh button, you should see the file `is_in_stock.py`. You can open and modify this file directly in RStudio. Before you do so, however, change the permissions of the `$HOME/is_in_stock.py` script so only YOU can read, write, and execute it: - -[source,bash] ----- -chmod 700 $HOME/is_in_stock.py ----- - -The script should now appear in RStudio, in your home directory, with the correct permissions. Open the script (in RStudio) and fill in the `main` function as indicated by the comments. We want the script to scrape to see whether the pants from question 3 are in stock or not. - -A cron job is a task that runs at a certain interval. Create a cron job that runs your script, `/class/datamine/apps/python/f2020-s2021/env/bin/python $HOME/is_in_stock.py` every 5 minutes. Wait 10-15 minutes and verify that it is working properly. The long path, `/class/datamine/apps/python/f2020-s2021/env/bin/python` simply makes sure that our script is run with access to all of the packages in our course environment. `$HOME/is_in_stock.py` is the path to your script (`$HOME` expands or transforms to `/home/`). - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -If you struggle to use the text editor used with the `crontab -e` command, be sure to continue reading the cron section of the book. We highlight another method that may be easier. -==== - -[TIP] -==== -Don't forget to copy your import statements from question (3) as well. -==== - -[IMPORTANT] -==== -Once you are finished with the project, if you no longer wish to receive emails every so often, follow the instructions here to remove the cron job. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The content of your cron job in a bash code chunk. -- The content of your `is_in_stock.py` script. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project05.adoc deleted file mode 100644 index 428b9b254..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project05.adoc +++ /dev/null @@ -1,162 +0,0 @@ -= STAT 29000: Project 5 -- Spring 2021 - -**Motivation:** One of the best things about learning to scrape data is the many applications of the skill that may pop into your mind. In this project, we want to give you some flexibility to explore your own ideas, but at the same time, add a couple of important tools to your tool set. We hope that you've learned a lot in this series, and can think of creative ways to utilize your new skills. - -**Context:** This is the last project in a series focused on scraping data. We have created a couple of very common scenarios that can be problematic when first learning to scrape data, and we want to show you how to get around them. - -**Scope:** python, web scraping, etc. - -.Learning objectives -**** -- Use the requests package to scrape a web page. -- Use the lxml/selenium package to filter and parse data from a scraped web page. -- Learn how to step around header-based filtering. -- Learn how to handle rate limiting. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -It is not uncommon to be blocked from scraping a website. There are a variety of strategies that they use to do this, and in general they work well. In general, if a company wants you to extract information from their website, they will make an API (application programming interface) available for you to use. One method (that is commonly paired with other methods) is blocking your request based on _headers_. You can read about headers https://developer.mozilla.org/en-US/docs/Glossary/Request_header[here]. In general, you can think of headers as some extra data that gives the server or client context. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers[Here] is a list of headers, and some more explanation. - -Each header has a purpose. One common header is called the https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[User-Agent header]. A User-Agent looks something like: - ----- -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 ----- - -You can see headers if you open the console in Firefox or Chrome and load a website. It will look something like: - -![](./images/headers01.png) - -From the https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[mozilla link], this header is a string that "lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent." Basically, if you are browsing the internet with a common browser, the server will know what you are using. In the provided example, we are using Firefox 86 from Mozilla, on a Mac running Mac OS 10.16 with an Intel processor. - -When we send a request from a package like `requests` in Python, here is what the headers look like: - -[source,python] ----- -import requests -response = requests.get("https://project5-headers.tdm.wiki") -print(response.request.headers) ----- - ----- -{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} ----- - -As you can see our User-Agent is `python-requests/2.25.1`. You will find that many websites block requests made from anything such user agents. One such website is: https://project5-headers.tdm.wiki. - -Scrape https://project5-headers.tdm.wiki from Scholar and explain what happens. What is the response code, and what does that response code mean? Can you ascertain what you would be seeing (more or less) in a browser based on the text of the response (the actual HTML)? Read https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers[this section of the documentation for the `headers` package], and attempt to "trick" https://project5-headers.tdm.wiki into presenting you with the desired information. The desired information should look something like: - ----- -Hostname: c1de5faf1daa -IP: 127.0.0.1 -IP: 172.18.0.4 -RemoteAddr: 172.18.0.2:34520 -GET / HTTP/1.1 -Host: project5-headers.tdm.wiki -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 -Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 -Accept-Encoding: gzip -Accept-Language: en-US,en;q=0.5 -Cdn-Loop: cloudflare -Cf-Connecting-Ip: 107.201.65.5 -Cf-Ipcountry: US -Cf-Ray: 62289b90aa55f975-EWR -Cf-Request-Id: 084d3f8e740000f975e0038000000001 -Cf-Visitor: {"scheme":"https"} -Cookie: __cfduid=d9df5daa57fae5a4e425173aaaaacbfc91613136177 -Dnt: 1 -Sec-Gpc: 1 -Upgrade-Insecure-Requests: 1 -X-Forwarded-For: 123.123.123.123 -X-Forwarded-Host: project5-headers.tdm.wiki -X-Forwarded-Port: 443 -X-Forwarded-Proto: https -X-Forwarded-Server: 6afe64faffaf -X-Real-Ip: 123.123.123.123 ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Response code received (a number), and an explanation of what that HTTP response code means. -- What you would (probably) be seeing in a browser if you were blocked. -- Python code used to "trick" the website into being scraped. -- The content of the successfully scraped site. -==== - -=== Question 2 - -Open a browser and navigate to: https://project5-rate-limit.tdm.wiki/. While at first glance, it will seem identical to https://project5-headers.tdm.wiki/, it is not. https://project5-rate-limit.tdm.wiki/ is rate limited based on IP address. Depending on when you are completing this project, this may or may not be obvious. If you refresh your browser fast enough, instead of receiving a bunch of information, you will receive text that says "Too Many Requests". - -The following function tries to scrape the `Cf-Request-Id` header which will have a unique value each request: - -[source,python] ----- -import requests -import lxml.html -def scrape_cf_request_id(url): - resp = requests.get(url) - tree = lxml.html.fromstring(resp.text) - content = tree.xpath("//p")[0].text.split('\n') - cfid = [l for l in content if 'Cf-Request-Id' in l][0].split()[1] - return cfid ----- - -You can test it out: - -[source,python] ----- -scrape_cf_request_id("https://project5-rate-limit.tdm.wiki") ----- - -Write code to scrape 10 unique `Cf-Request-Id`s (in a loop), and save them to a list called `my_ids`. What happens when you run the code? This is caused by our expected text not being present. Instead text with "Too Many Requests" is. While normally this error would be something that makes more sense, like an HTTPError or a Timeout Exception, it _could_ be anything, depending on your code. - -One solution that might come to mind is to "wait" between each loop using `time.sleep()`. While yes, this may work, it is not a robust solution. Other users from your IP address may count towards your rate limit and cause your function to fail, the amount of sleep time may change dynamically, or even be manually adjusted to be longer, etc. The best way to handle this is to used something called exponential backoff. - -In a nutshell, exponential backoff is a way to increase the wait time (exponentially) until an acceptable rate is found. https://pypi.org/project/backoff/[`backoff`] is an excellent package to do just that. `backoff`, upon being triggered from a specified error or exception, will wait to "try again" until a certain amount of time has passed. Upon receving the same error or exception, the time to wait will increase exponentially. Use `backoff` to modify the provided `scrape_cf_request_id` function to use exponential backoff when the we alluded to occurs. Test out the modified function in a loop and print the resulting 10 `Cf-Request-Id`s. - -++++ - -++++ - -[NOTE] -==== -`backoff` utilizes decorators. For those interested in learning about decorators, https://realpython.com/primer-on-python-decorators/[this] is an excellent article. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- What happens when you run the function 10 times in a row? -- Fixed code that will work regardless of the rate limiting. -- 10 unique `Cf-Request-Id`s printed. -==== - -=== Question 3 - -You now have a great set of tools to be able to scrape pretty much anything you want from the internet. Now all that is left to do is practice. Find a course appropriate website containing data you would like to scrape. Utilize the tools you've learned about to scrape at least 100 "units" of data. A "unit" is just a representation of what you are scraping. For example, a unit could be a tweet from Twitter, a basketball player's statistics from sportsreference, a product from Amazon, a blog post from your favorite blogger, etc. - -The hard requirements are: - -- Documented code with thorough comments explaining what the code does. -- At least 100 "units" scraped. -- The data must be from multiple web pages. -- Write at least 1 function (with a docstring) to help you scrape. -- A clear explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example a `head` of a pandas dataframe containing the data). - -.Items to submit -==== -- Python code that scrapes 100 unites of data (with thorough comments explaining what the code does). -- The data must be from more than a single web page. -- 1 or more functions (with docstrings) used to help you scrape/parse data. -- Clear documentation and explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example using the `head` of a dataframe containing the data). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project06.adoc deleted file mode 100644 index 825c59d0e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project06.adoc +++ /dev/null @@ -1,90 +0,0 @@ -= STAT 29000: Project 6 -- Spring 2021 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in _many_ fields. It can be pretty fun too! In this project, we are going to dive into `matplotlib` with an open project. - -**Context:** We've been working hard all semester and learning a lot about web scraping. In this project we are going to ask you to examine some plots, write a little bit, and use your creative energies to create good visualizations about the flight data using the go-to plotting library for many, `matplotlib`. In the next project, we will continue to learn about and become comfortable using `matplotlib`. - -**Scope:** python, visualizing data - -.Learning objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/flights/*.csv` (all csv files) - -== Questions - -=== Question 1 - -http://stat-computing.org/dataexpo/2009/posters/[Here] are the results from the 2009 Data Expo poster competition. The object of the competition was to visualize interesting information from the flights dataset. Examine all 8 posters and write a single sentence for each poster with your first impression(s). An example of an impression that will not get full credit would be: "My first impression is that this poster is bad and doesn't look organized.". An example of an impression that will get full credit would be: "My first impression is that the author had a good visualization-to-text ratio and it seems easy to follow along.". - -++++ - -++++ - -.Items to submit -==== -- 8 bullets, each containing a sentence with the first impression of the 8 visualizations. Order should be "first place", to "honourable mention", followed by "other posters" in the given order. Or, label which graphic each sentence is about. -==== - -=== Question 2 - -https://www.amazon.com/dp/0985911123/[Creating More Effective Graphs] by Dr. Naomi Robbins and https://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414/ref=sr_1_1?dchild=1&keywords=elements+of+graphing+data&qid=1614013761&sr=8-1[The Elements of Graphing Data] by Dr. William Cleveland at Purdue University, are two excellent books about data visualization. Read the following excerpts from the books (respectively), and list 2 things you learned, or found interesting from _each_ book. - -- https://thedatamine.github.io/the-examples-book/files/CreatingMoreEffectiveGraphs.pdf[Excerpt 1] -- https://thedatamine.github.io/the-examples-book/files/ElementsOfGraphingData.pdf[Excerpt 2] - -++++ - -++++ - -.Items to submit -==== -- Two bullets for each book with items you learned or found interesting. -==== - -=== Question 3 - -Of the 7 posters with at least 3 plots and/or maps, choose 1 poster that you think you could improve upon or "out plot". Create 4 plots/maps that either: - -. Improve upon a plot from the poster you chose, or -. Show a completely different plot that does a good job of getting an idea or observation across, or -. Ruin a plot. Purposefully break the best practices you've learned about in order to make the visualization misleading. (limited to 1 of the 4 plots) - -For each plot/map where you choose to do (1), include 1-2 sentences explaining what exactly you improved upon and how. Point out some of the best practices from the 2 provided texts that you followed. For each plot/map where you choose to do (2), include 1-2 sentences explaining your graphic and outlining the best practices from the 2 texts that you followed. For each plot/map where you choose to do (3), include 1-2 sentences explaining what you changed, what principle it broke, and how it made the plot misleading or worse. - -While we are not asking you to create a poster, please use RMarkdown to keep your plots, code, and text nicely formatted and organized. The more like a story your project reads, the better. In this project, we are restricting you to use `matplotlib` in Python. While there are many interesting plotting packages like `plotly` and `plotnine`, we really want you to take the time to dig into `matplotlib` and learn as much as you can. - -++++ - -++++ - -.Items to submit -==== -- All associated Python code you used to wrangling the data and create your graphics. -- 4 plots, with at least 4 associated RMarkdown code chunks. -- 1-2 sentences per plot explaining what exactly you improved upon, what best practices from the texts you used, and how. If it is a brand new visualization, describe and explain your graphic, outlining the best practices from the 2 texts that you followed. If it is the ruined plot you chose, explain what you changed, what principle it broke, and how it made the plot misleading or worse. -==== - -=== Question 4 - -Now that you've been exploring data visualization, copy, paste, and update your first impressions from question (1) with your updated impressions. Which impression changed the most, and why? - -++++ - -++++ - -.Items to submit -==== -- 8 bullets with updated impressions (still just a sentence or two) from question (1). -- A sentence explaining which impression changed the most and why. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project07.adoc deleted file mode 100644 index df33201e7..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project07.adoc +++ /dev/null @@ -1,133 +0,0 @@ -= STAT 29000: Project 7 -- Spring 2021 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in _many_ fields. It can be pretty fun too! As you probably noticed in the previous project, `matplotlib` can be finicky -- certain types of plots are really easy to create, while others are not. For example, you would think changing the color of a boxplot would be easy to do in `matplotlib`, perhaps we just need to add an option to the function call. As it turns out, this isn't so straightforward (as illustrated at the end of [this section](#p-matplotlib-boxplot)). Occasionally this will happen and that is when packages like `seaborn` or `plotnine` (both are packages built using `matplotlib`) can be good. In this project we will explore this a little bit, and learn about some useful `pandas` functions to help shape your data in a format that any given package requires. - -**Context:** In the next project, we will continue to learn about and become comfortable using `matplotlib`, `seaborn`, and `plotnine`. - -**Scope:** python, visualizing data - -.Learning objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/apple/health/watch_dump.xml` - -== Questions - -=== Question 1 - -In an earlier project we explored some XML data in the form of an Apple Watch data dump. Most health-related apps give you some sort of graph or set of graphs as an output. Use any package you want to parse the XML data. `Record` is a dense category in this dataset. Each `Record` has an attribute called `creationDate`. Create a barplot of the number of `Records` per day. Make sure your plot is polished, containing proper labels and good colors. - -[TIP] -==== -You could start by parsing out the required data into a `pandas` dataframe or series. -==== - -[TIP] -==== -The `groupby` method is one of the most useful `pandas` methods. It allows you to quickly perform operations on groups of data. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== - -=== Question 2 - -The plot in question 1 should look bimodal. Let's focus only on the first apparent group of readings. Create a new dataframe containing only the readings for the time period from 9/1/2017 to 5/31/2019. How many `Records` are there in that time period? - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -It is hard to discern weekly patterns (if any) based on the graphics created so far. For the period of time in question 2, create a labeled bar plot for the count of `Record` by day of the week. What (if any) discernable patterns are there? Make sure to include the labels provided below: - -[source,python] ----- -labels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== - -=== Question 4 - -Create a `pandas` dataframe containing the following data from `watch_dump.xml`: - -- A column called `bpm` with the `bpm` (beats per minute) of the `InstantaneousBeatsPerMinute`. -- A column called `time` with the `time` of each individual `bpm` reading in `InstantaneousBeatsPerMinute`. -- A column called `date` with the date. -- A column called `dayofweek` with the day of the week. - -[TIP] -==== -You may want to use `pd.to_numeric` to convert the `bpm` column to a numeric type. -==== - -[TIP] -==== -This is one way to convert the numbers 0-6 to days of the week: - -[source,python] ----- -myDF['dayofweek'] = myDF['dayofweek'].map({0:"Mon", 1:"Tue", 2:"Wed", 3:"Thu", 4:"Fri", 5: "Sat", 6: "Sun"}) ----- -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create a heatmap using `seaborn`, where the y-axis shows the day of the week ("Mon" - "Sun"), the x-axis shows the hour, and the values on the interior of the plot are the average `bpm` by hour by day of the week. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project08.adoc deleted file mode 100644 index 013632710..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project08.adoc +++ /dev/null @@ -1,452 +0,0 @@ -= STAT 29000: Project 8 -- Spring 2021 - -**Motivation:** Python is an https://www.geeksforgeeks.org/internal-working-of-python/[interpreted language] (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R -- also an interpreted language), we can run and evaluate a line of code easily using a https://en.wikipedia.org/wiki/Read-eval-print_loop[repl]. In fact, this is the way you've been using Python to date -- selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI's (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks. - -**Context:** This is the first (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** python - -.Learning objectives -**** -- Write a python script that accepts user inputs and returns something useful. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Often times the deliverable part of a project isn't custom built packages or modules, but a script. A script is a .py file with python code written inside to perform action(s). Python scripts are incredibly easy to run, for example, if you had a python script called `question01.py`, you could run it by opening a terminal and typing: - -[source,bash] ----- -python3 /path/to/question01.py ----- - -The python interpreter then looks for the scripts entrypoint, and starts executing. You should read https://realpython.com/python-main-function/[this] article about the main function and python scripts. In addition, read https://realpython.com/run-python-scripts/#using-the-script-filename[this] section, paying special attention to the shebang. - -Create a Python script called `question01.py` in your `$HOME` directory. Use the second shebang from the article: `#!/usr/bin/env python3`. When run, `question01.py` should use the `sys` package to print the location of the interpreter being used to run the script. For example, if we started a Python interpreter in RStudio using the following code: - -[source,r] ----- -datamine_py() -reticulate::repl_python() ----- - -Then, we could print the interpreter by running the following Python code one line at a time: - -[source,python] ----- -import sys -print(sys.executable) ----- - -Since we are using our Python environment, you should see this result: `/class/datamine/apps/python/f2020-s2021/env/bin/python3`. This is the fully qualified path of the Python interpreter we've been using for this course. - -Restart your R session by clicking `Session > Restart R`, navigate to the "Terminal" tab in RStudio, and run the following lines in the terminal. What is the output? - -[source,bash] ----- -# this command gives execute permissions to your script -- this only needs to be run once -chmod +x $HOME/question01.py -# execute your script -$HOME/question01.py ----- - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entire `question01.py` script's contents in a Python code chunk with chunk option "eval=F". -- Output from running your code copy and pasted as text. -==== - -=== Question 2 - -Was your output in question (1) expected? Why or why not? - -When we restarted the R session, our `datamine_py`'s effects were reversed, and the default Python interpreter is no longer our default when running `python3`. It is very common to have a multitude of Python environments available to use. But, when we are running a Python script it is _not_ convenient to have to run various commands (in our case, the single `datamine_py` command) in order to get our script to run the way we want it to run. In addition, if our script used a set of packages that were not installed outside of our course environment, the script would fail. - -In this project, since our focus is more on how to write scripts and make them work as expected, we will have some fun and experiment with some pre-trained state of the art machine learning models. - -The following function accepts a string called `sentence` as an input and returns the sentiment of the sentence, "POSITIVE" or "NEGATIVE". - -[source,python] ----- -from transformers import pipeline -def get_sentiment(model, sentence: str) -> str: - result = model(sentence) - - return result[0].get('label') -model = pipeline('sentiment-analysis') -print(get_sentiment(model, 'This is really great!')) -print(get_sentiment(model, 'Oh no! Thats horrible!')) ----- - -Include `get_sentiment` (including the import statement) in a new script, `question02.py` script. Note that you do not have to _use_ `get_sentiment` anywhere, just include it for now. Go to the terminal in RStudio and execute your script. What happens? - -Remember, since our current shebang is `#!/usr/bin/env python3`, if our script uses one or more packages that are not installed in the current environment environment, the script will fail. This is what is happening. The `transformers` package that we use is not installed in the current environment. We do, however, have an environment that _does_ have it installed, and it is located on Scholar at: `/class/datamine/apps/python/pytorch2021/env/bin/python`. Update the script's shebang and try to run it again. Does it work now? - -Depending on the state of your current environment, the original shebang, `#!/usr/bin/env python3` will use the same Python interpreter and environment that is currently set to `python3` (run `which python3` to see). If you haven't run `datamine_py`, this will be something like: `/apps/spack/scholar/fall20/apps/anaconda/2020.11-py38-gcc-4.8.5-djkvkvk/bin/python` or `/usr/bin/python`, if you _have_ run `datamine_py`, this will be: `/class/datamine/apps/python/f2020-s2021/env/bin/python`. _Both_ environments lack the `transformers` package. Our other environment whose interpreter lives here: `/class/datamine/apps/python/pytorch2021/env/bin/python` _does_ have this package. The shebang is then critically important for any scripts that want to utilize packages from a specific environment. - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -.Items to submit -==== -- Sentence explaining why or why not the output from question (1) was expected. -- Sentence explaining what happens when you include `get_sentiment` in your script and try to execute it. -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -==== - -=== Question 3 - -Okay, great. We now understand that if we want to use packages from a specific environment, we need to modify our shebang accordingly. As it currently stands, our script is pretty useless. Modify the script, in a new script called `question03.py` to accept a single argument. This argument should be a sentence. Your script should then print the sentence, and whether or not the sentence is "POSITIVE" or "NEGATIVE". Use `sys.argv` to accomplish this. Make sure the script functions in the following way: - -[source,bash] ----- -$HOME/question03.py This is a happy sentence, yay! ----- - ----- -Too many arguments. ----- - -[source,bash] ----- -$HOME/question03.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question03.py ----- - ----- -./question03.py requires at least 1 argument, "sentence". ----- - -[TIP] -==== -One really useful way to exit the script and print a message is like this: - -[source,python] ----- -import sys -sys.exit(f"{__file__} requires at least 1 argument, 'sentence'") ----- -==== - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 4 - -If you look at the man pages for a command line tool like `awk` or `grep` (you can get these by running `man awk` or `man grep` in the terminal), you will see that typically CLI's have a variety of options. Options usually follow the following format: - -[source,bash] ----- -grep -i 'ok' some_file.txt ----- - -However, often times you have 2 ways you can use an option -- either with the short form (for example `-i`), or long form (for example `-i` is the same as `--ignore-case`). Sometimes options can get values. If options don't have values, you can assume that the presence of the flag means `TRUE` and the lack means `FALSE`. When using short form, the value for the option is separated by a space (for example `grep -f my_file.txt`). When using long form, the value for the option is separated by an equals sign (for example `grep --file=my_file.txt`). - -Modify your script (as a new `question04.py`) to include an option called `score`. When active (`question04.py --score` or `question04.py -s`), the script should return both the sentiment, "POSITIVE" or "NEGATIVE" and the probability of being accurate. Make sure that you modify your checks from question 3 to continue to work whenever we use `--score` or `-s`. Some examples below: - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question04.py --score 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py -s 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' -s ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --score ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --value ----- - ----- -Unknown option(s): ['--value'] ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --value --score ----- - ----- -Too many arguments. ----- - -[source,bash] ----- -$HOME/question04.py ----- - ----- -question04.py requires at least 1 argument, "sentence" ----- - -[source,bash] ----- -$HOME/question04.py --score ----- - ----- -./question04.py requires at least 1 argument, "sentence". No sentence provided. ----- - -[source,bash] ----- -$HOME/question04.py 'This is one sentence' 'This is another' ----- - ----- -./question04.py requires only 1 sentence, but 2 were provided. ----- - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -[TIP] -==== -Experiment with the provided function. You will find the probability of being accurate is already returned by the model. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 5 - -Wow, that is an extensive amount of logic for for a single option. Luckily, Python has the `argparse` package to help you build CLI's and handle situations like this. You can find the documentation for argparse https://docs.python.org/3/library/argparse.html[here] and a nice little tutorial https://docs.python.org/3/howto/argparse.html[here]. Update your script (as a new `question05.py`) using `argparse` instead of custom logic. Specifically, add 1 positional argument called "sentence", and 1 optional argument "--score" or "-s". You should handle the following scenarios: - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question05.py --score 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py -s 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' -s ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --score ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --value ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: --value ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --value --score ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: --value ----- - -[source,bash] ----- -$HOME/question05.py ----- - ----- -usage: question05.py [-h] [-s] sentence -positional arguments: - sentence -optional arguments: - -h, --help show this help message and exit - -s, --score display the probability of accuracy ----- - -[source,bash] ----- -$HOME/question05.py --score ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: too few arguments ----- - -[source,bash] ----- -$HOME/question05.py 'This is one sentence' 'This is another' ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: This is another ----- - -[TIP] -==== -A good way to print the help information if no arguments are provided is: - -[source,python] ----- -if len(sys.argv) == 1: - parser.print_help() - parser.exit() ----- -==== - -[IMPORTANT] -==== -Include the bash code chunk option `error=T` to enable RMarkdown to knit and output errors. -==== - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project09.adoc deleted file mode 100644 index 1b6f853c3..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project09.adoc +++ /dev/null @@ -1,318 +0,0 @@ -= STAT 29000: Project 9 -- Spring 2021 - -**Motivation:** In the previous project you worked through some common logic needed to make a good script. By the end of the project `argparse` was (hopefully) a welcome package to be able to use. In this project, we are going to continue to learn about `argparse` and create a CLI for the https://data.whin.org[WHIN Data Portal]. In doing so, not only will we get to practice using `argparse`, but you will also get to learn about using an API to retrieve data. An API (application programming interface) is a common way to retrieve structured data from a company or resource. It is common for large companies like Twitter, Facebook, Google, etc. to make certain data available via API's, so it is important to get some exposure. - -**Context:** This is the second (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** python - -.Learning objectives -**** -- Write a python script that accepts user inputs and returns something useful. -- Interact with an API to retrieve data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will involve retrieving data using an API. Instructions and hints will be provided as we go. - -== Questions - -=== Question 1 - -WHIN (Wabash Heartland Innovation Network) has deployed hundreds of weather stations across the region so farmers can use the data collected to become more efficient, save time, and increase yields. WHIN has kindly granted access to 20+ public-facing weather stations for educational purposes. - -Navigate to https://data.whin.org/data/current-conditions, and click on the "CREATE ACCOUNT" button in the middle of the screen: - -![](./images/p9_01.png) - -Click on "I'm a student or educator": - -![](./images/p9_02.png) - -Enter your information. For "School or Organization" please enter "Purdue University". For "Class or project", please put "The Data Mine Project 9". For the description, please put "We are learning about writing scripts by writing a CLI to fetch data from the WHIN API." Please use your purdue.edu email address. Once complete, click "Next". - -Carefully read the LICENSE TERMS before accepting, and confirm your email address if needed. Upon completion, navigate here: https://data.whin.org/data/current-conditions - -Read about the API under "API Usage". An endpoint is the place (in this case the end of a URL (which can be referred to as the URI)) that you can use to access/delete/update/etc. a given resource depending on the HTTP method used. What are the 3 endpoints of this API? - -Write and run a script called `question01.py` that, when run, tries to print the current listing of the weather stations. Instead of printing what you think it should print, it will print something else. What happened? - -[source,bash] ----- -$HOME/question01.py ----- - -[TIP] -==== -You can use the `requests` library to run the HTTP GET method on the endpoint. For example: -==== - -[source,python] ----- -import requests -response = requests.get("https://datamine.purdue.edu/") -print(response.json()) ----- - -[TIP] -==== -We want to use our regular course environment, therefore, make sure to use the following shebang: `#!/class/datamine/apps/python/f2020-s2021/env/bin/python` -==== - -++++ - -++++ - -.Items to submit -==== -- List the 3 endpoints for this API. -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 2 - -In question 1, we quickly realize that we are missing a critical step -- authentication! Recall that authentication is the process of a system understanding _who_ a person is, authorization is the process of telling whether or not somebody (or something) has permissions to access/modify/delete/etc. a resource. When we make GET requests to https://data.whin.org/api/weather/stations, or any other endpoint, the server that returns the data we are trying to access has no clue who we are, which explains the result from question 1. - -While there are many methods of authentication, WHIN is using Bearer tokens. Navigate https://data.whin.org/account[here]. Take a look at your account info. You should see a large chunk of random numbers and text. This is your bearer token that you can use for authentication. The bearer token is to be sent in the "Authorization" header of the request. For example: - -[source,python] ----- -import requests -my_headers = {"Authorization": "Bearer LDFKGHSOIDFRUTRLKJNXDFGT"} -response = requests.get("my_url", headers = my_headers) ----- - -Update your script (as a new script called `question02.py`), and test it out again to see if we get the expected results now. `question02.py` should only print the first 5 results. - -A couple important notes: - -- The bearer token should be taken care of like a password. You do NOT want to share this, ever. -- There is an inherent risk in saving code like the code shown above. What if you accidentally upload it to GitHub? Then anyone with access could potentially read and use your token. - -How can we include the token in our code without typing it in our code? The typical way to handle this is to use environment variables and/or a file containing the information that is specifically NOT shared unless necessary. For example, create a file called `.env` in your home directory, with the following contents: - -[source,txt] ----- -MY_BEARER_TOKEN=aslgdkjn304iunglsejkrht09 -SOME_OTHER_VARIABLE=some_other_value ----- - -In this file, replace the "aslgdkj..." part with you actual token and save the file. Then make sure only YOU can read and write to this file by running the following in a terminal: - -[source,bash] ----- -chmod 600 $HOME/.env ----- - -Now, we can use a package called `dotenv` to load the variables in the `$HOME/.env` file into the environment. We can then use the `os` package to get the environment variables. For example: - -[source,python] ----- -import os -from dotenv import load_dotenv -# This function will load the .env file variables from the same directory as the script into the environment -load_dotenv() -# We can now use os.getenv to get the important information without showing anything. -# Now, all anybody reading the code sees is "os.getenv('MY_BEARER_TOKEN')" even though that is replaced by the actual -# token when the code is run, cool! -my_headers = {"Authorization": f"Bearer {os.getenv('MY_BEARER_TOKEN')}"} ----- - -Update `question02.py` to use `dotenv` and `os.getenv` to get the token from the local `$HOME/.env` file. Test out your script: - -[source,bash] ----- -$HOME/question02.py ----- - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given example. -==== - -=== Question 3 - -That's not so bad! We now know how to retrieve data from the API as well as load up variables from our environment rather than insecurely just pasting them in our code, great! - -A query parameter is (more or less) some extra information added at the end of the endpoint. For example, the following url has a query parameter called `param` and value called `value`: \https://example.com/some_resource?param=value. You could even add more than one query parameter as follows: \https://example.com/some_resource?param=value&second_param=second_value -- as you can see, now we have another parameter called `second_param` with a value of `second_value`. While the query parameters begin with a `?`, each subsequent parameter is added using `&`. - -Query parameters can be optional or required. API's will sometimes utilize query parameters to filter or fine-tune the returned results. Look at the documentation for the `/api/weather/station-daily` endpoint. Use your newfound knowledge of query parameters to update your script (as a new script called `question03.py`) to retrieve the data for station with id `150` on `2021-01-05`, and print the first 5 results. Test out your script: - -[source,bash] ----- -$HOME/question03.py ----- - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given example. -==== - -=== Question 4 - -Excellent, now let's build our CLI. Call the script `whin.py`. Use your knowledge of `requests`, `argparse`, and API's to write a CLI that replicates the behavior shown below. For convenience, only print the first 2 results for all output. - -[TIP] -==== -- In general, there will be 3 commands: `stations`, `daily`, and `cc` (for current condition). -- You will want to create a subparser for each command: `stations_parser`, `current_conditions_parser`, and `daily_parser`. -- The `daily_parser` will have 2 _position_, _required_ arguments: `station_id` and `date`. -- The `current_conditions_parser` will have 2 _optional_ arguments of type `str`: `--center`/`-c` and `--radius`/`-r`. -- If only one of `--center` or `--radius` is present, you should use `sys.exit` to print a message saying "Need both center AND radius, or neither.". -- To create a subparser, just do the following: - -[source,python] ----- -parser = argparse.ArgumentParser() -subparsers = parser.add_subparsers(help="possible commands", dest="command") -my_subparser = subparsers.add_parser("my_command", help="my help message") -my_subparser.add_argument("--my-option", type=str, help="some option") -args = parser.parse_args() ----- - -- Then, you can access which command was run with `args.command` (which in this case would only have 1 possible value of `my_command`), and access any parser or subparsers options with `args`, for example, `args.my_option`. -==== - -[source,bash] ----- -$HOME/whin.py ----- ----- -usage: whin.py [-h] {stations,cc,daily} ... -positional arguments: - {stations,cc,daily} possible commands - stations list the stations - cc list the most recent data from each weather station - daily list data from a given day and station -optional arguments: - -h, --help show this help message and exit ----- - -[TIP] -==== -A good way to print the help information if no arguments are provided is: - -[source,python] ----- -if len(sys.argv) == 1: - parser.print_help() - parser.exit() ----- -==== - -[source,bash] ----- -$HOME/whin.py stations -h ----- ----- -usage: whin.py stations [-h] -optional arguments: - -h, --help show this help message and exit ----- - -[source,bash] ----- -$HOME/whin.py cc -h ----- ----- -usage: whin.py cc [-h] [-c CENTER] [-r RADIUS] -optional arguments: - -h, --help show this help message and exit - -c CENTER, --center CENTER - return results near this center coordinate, given as a - latitude,longitude pair - -r RADIUS, --radius RADIUS - search distance, in meters, from the center ----- - -[source,bash] ----- -$HOME/whin.py cc ----- ----- -[{'humidity': 90, 'latitude': 40.93894, 'longitude': -86.47418, 'name': 'WHIN001-PULA001', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.051', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 6, 'soil_moist_2': 11, 'soil_moist_3': 14, 'soil_moist_4': 9, 'soil_temp_1': 42, 'soil_temp_2': 40, 'soil_temp_3': 40, 'soil_temp_4': 41, 'solar_radiation': 203, 'solar_radiation_high': 244, 'station_id': 1, 'temperature': 40, 'temperature_high': 40, 'temperature_low': 40, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '22.5', 'wind_gust_speed_mph': 6, 'wind_speed_mph': 3}, {'humidity': 88, 'latitude': 40.73083, 'longitude': -86.98467, 'name': 'WHIN003-WHIT001', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.051', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 6, 'soil_moist_2': 5, 'soil_moist_3': 6, 'soil_moist_4': 4, 'soil_temp_1': 40, 'soil_temp_2': 39, 'soil_temp_3': 39, 'soil_temp_4': 40, 'solar_radiation': 156, 'solar_radiation_high': 171, 'station_id': 3, 'temperature': 40, 'temperature_high': 40, 'temperature_low': 39, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '337.5', 'wind_gust_speed_mph': 8, 'wind_speed_mph': 3}] ----- - -[IMPORTANT] -==== -Your values may be different because they are _current_ conditions. -==== - -[source,bash] ----- -$HOME/whin.py cc --radius=10000 ----- ----- -Need both center AND radius, or neither. ----- - -[source,bash] ----- -$HOME/whin.py cc --center=40.4258686,-86.9080654 ----- ----- -Need both center AND radius, or neither. ----- - -[source,bash] ----- -$HOME/whin.py cc --center=40.4258686,-86.9080654 --radius=10000 ----- ----- -[{'humidity': 86, 'latitude': 40.42919, 'longitude': -86.84547, 'name': 'WHIN008-TIPP005 Chatham Square', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.012', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 5, 'soil_moist_3': 5, 'soil_moist_4': 5, 'soil_temp_1': 42, 'soil_temp_2': 41, 'soil_temp_3': 41, 'soil_temp_4': 42, 'solar_radiation': 191, 'solar_radiation_high': 220, 'station_id': 8, 'temperature': 42, 'temperature_high': 42, 'temperature_low': 42, 'wind_direction_degrees': '0', 'wind_gust_direction_degrees': '22.5', 'wind_gust_speed_mph': 9, 'wind_speed_mph': 3}, {'humidity': 86, 'latitude': 40.38494, 'longitude': -86.84577, 'name': 'WHIN027-TIPP003 EXT', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '29.515', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 4, 'soil_moist_3': 4, 'soil_moist_4': 5, 'soil_temp_1': 43, 'soil_temp_2': 42, 'soil_temp_3': 42, 'soil_temp_4': 42, 'solar_radiation': 221, 'solar_radiation_high': 244, 'station_id': 27, 'temperature': 43, 'temperature_high': 43, 'temperature_low': 43, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '337.5', 'wind_gust_speed_mph': 6, 'wind_speed_mph': 3}] ----- - -[source,bash] ----- -$HOME/whin.py daily ----- ----- -usage: whin.py daily [-h] station_id date -whin.py daily: error: too few arguments ----- - -[source,bash] ----- -$HOME/whin.py daily 150 2021-01-05 ----- ----- -[{'humidity': 96, 'latitude': 41.00467, 'longitude': -86.68428, 'name': 'WHIN058-PULA007', 'observation_time': '2021-01-05T05:00:00Z', 'pressure': '29.213', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 6, 'soil_moist_3': 7, 'soil_moist_4': 5, 'soil_temp_1': 33, 'soil_temp_2': 34, 'soil_temp_3': 35, 'soil_temp_4': 35, 'solar_radiation': 0, 'solar_radiation_high': 0, 'station_id': 150, 'temperature': 31, 'temperature_high': 31, 'temperature_low': 31, 'wind_direction_degrees': '270', 'wind_gust_direction_degrees': '292.5', 'wind_gust_speed_mph': 13, 'wind_speed_mph': 8}, {'humidity': 96, 'latitude': 41.00467, 'longitude': -86.68428, 'name': 'WHIN058-PULA007', 'observation_time': '2021-01-05T05:15:00Z', 'pressure': '29.207', 'rain': '1', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 6, 'soil_moist_3': 7, 'soil_moist_4': 5, 'soil_temp_1': 33, 'soil_temp_2': 34, 'soil_temp_3': 35, 'soil_temp_4': 35, 'solar_radiation': 0, 'solar_radiation_high': 0, 'station_id': 150, 'temperature': 31, 'temperature_high': 31, 'temperature_low': 31, 'wind_direction_degrees': '270', 'wind_gust_direction_degrees': '292.5', 'wind_gust_speed_mph': 14, 'wind_speed_mph': 9}] ----- - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 5 - -There are a multitude of improvements and/or features that we could add to `whin.py`. Customize your script (as a new script called `question05.py`), to either do something new, or fix a scenario that wasn't covered in question 4. Be sure to include 1-2 sentences that explains exactly what your modification does. Demonstrate the feature by running it in a bash code chunk. - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project10.adoc deleted file mode 100644 index 001d47215..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project10.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= STAT 29000: Project 10 -- Spring 2021 - -**Motivation:** The use of a suite of packages referred to as the `tidyverse` is popular with many R users. It is apparent just by looking at `tidyverse` R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We've covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The `tidyverse` consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the `tidyverse` is getting the data to be https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[tidy]. You can read a lot more about this in Hadley Wickham's excellent book, https://r4ds.had.co.nz[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html#what-you-will-learn[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: - 1. Transform - 2. Visualize - 3. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/okcupid/filtered/*.csv` - -== Questions - -=== Question 1 - -Let's (more or less) follow the guidelines given above. The first step is to https://r4ds.had.co.nz/data-import.html[import] the data. There are two files: `questions.csv`, and `users.csv`. Read https://r4ds.had.co.nz/data-import.html[this section], and use what you learn to read in the two files into `questions` and `users`, respectively. Which functions from the `tidyverse` did you use and why? - -[TIP] -==== -Its easy to load up the `tidyverse` packages: - -[source,r] ----- -library(tidyverse) ----- -==== - -[TIP] -==== -Just because a file has the `.csv` extension does _not_ mean that is it comma separated. -==== - -[TIP] -==== -Make sure to print all `tibble` after reading them in to ensure that they were read in correctly. If they were not, use a different function (from `tidyverse`) to read in the data. -==== - -[TIP] -==== -`questions` should be 2281 x 10 and `users` should be 68371 x 2284 -==== - -.Items to submit -==== -- R code used to solve the problem. -- `head` of each dataset, `users` and `questions`. -- 1 sentence explaining which functions you used (from `tidyverse`) and why. -==== - -=== Question 2 - -You may recall that the function `read.csv` from base R reads data into a data.frame by default. In the `tidyverse`, `readr` functions read the data into a `tibble` instead. Read https://r4ds.had.co.nz/tibbles.html[this section]. To summarize, some important features that are true for `tibbles` but not necessarily for data.frames are: - -- Non-syntactic variable names (surrounded by backticks \\`` ` `` ) -- Never changes the type of the inputs (for example converting strings to factors) -- More informative output from printing -- No partial matching -- Simple https://r4ds.had.co.nz/tibbles.html#subsetting[subsetting] - -Great, the next step in our outline is to make the data "tidy". Read https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[this section]. Okay, let's say, for instance, that we wanted to create a `tibble` with the following columns: `user`, `question`, `question_text`, `selected_option`, `race`, `gender2`, `gender_orientation`, `n`, and `keywords`. As you can imagine, the "tidy" format, while great for analysis, would _not_ be great for storage as there would be a row for each question for each user, at least. Columns like `gender2` and `race` don't change for a user, so we end up with a lot of repeated values. - -Okay, we don't need to analyze all 68000 users at once, let's instead, take a random sample of 2200 users, and create a "tidy" `tibble` as described above. After all, we want to see why this format is useful! While trying to figure out how to do this may seem daunting at first, it is actually not _so_ bad: - -First, we convert the `users` tibble to long form, so each row represents 1 answer to 1 questions from 1 user: - -[source,r] ----- -# Add an "id" columns to the users data -users$id <- 1:nrow(users) -# To ensure we get the same random sample, run the set.seed line -# before every time you run the following line -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% # This converts all of our columns in columns_to_pivot to strings - pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") # The old qXXXX columns are now values in the "question" column. ----- - -Next, we want to merge our data from the `questions` tibble with our `users_sample_long` tibble, into a new table we will call `myDF`. How many rows and columns are in `myDF`? - -[source,r] ----- -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 3 - -Excellent! Now, we have a nice tidy dataset that we can work with. You may have noticed some odd syntax `%>%` in the code provided in the previous question. `%>%` is the piping operator in R added by the `magittr` package. It works pretty much just like `|` does in bash. It "feeds" the output from the previous bit of code to the next bit of code. It is extremely common practice to use this operator in the `tidyverse`. - -Observe the `head` of `myDF`. Notice how our `question` column has the value `d_age`, `text` has the content "Age", and `selected_option` (the column that shows the "answer" the user gave), has the actual age of the user. Wouldn't it be better if our `myDF` had a new column called `age` instead of `age` being an answer to a question? - -Modify the code provided in question 2 so `age` ends up being a column in `myDF` with the value being the actual age of the user. - -[TIP] -==== -Pay close attention to https://tidyr.tidyverse.org/reference/pivot_longer.html[`pivot_longer`]. You will need to understand what this function is doing to fix this. -==== - -[TIP] -==== -You can make a single modification to 1 line to accomplish this. Pay close attention to the `cols` option in `pivot_longer`. If you include a column in `cols` what happens? If you exclude a columns from `cols` what happens? Experiment on the following `tibble`, using different values for `cols`, as well as `names_to`, and `values_to`: - -[source,r] ----- -myDF <- tibble( - x=1:3, - y=1, - question1=c("How", "What", "Why"), - question2=c("Really", "You sure", "When"), - question3=c("Who", "Seriously", "Right now") -) ----- -==== - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 4 - -Wow! That is pretty powerful! Okay, it is clear that there are question questions, where the column starts with "q", and other questions, where the column starts with something else. Modify question (3) so all of the questions that _don't_ start with "q" have their own column in `myDF`. Like before, show the number of rows and columns for the new `myDF`, as well as print the `head`. - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 5 - -It seems like we've spent the majority of the project just wrangling our dataset -- that is normal! You'd be incredibly lucky to work in an environment where you recieve data in a nice, neat, perfect format. Let's do a couple basic operations now, to practice. - -https://dplyr.tidyverse.org/reference/mutate.html[`mutate`] is a powerful function in `dplyr`, that is not easy to mimic in Python's `pandas` package. `mutate` adds new columns to your tibble, while preserving your existing columns. It doesn't sound very powerful, but it is. - -Use mutate to create a new column called `generation`. `generation` should contain "Gen Z" for ages [0, 24], "Millenial" for ages [25-40], "Gen X" for ages [41-56], and "Boomers II" for ages [57-66], and "Older" for all other ages. - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 6 - -Use `ggplot` to create a scatterplot showing `d_age` on the x-axis, and `lf_min_age` on the y-axis. `lf_min_age` is the minimum age a user is okay dating. Color the points based on `gender2`. Add a proper title, and labels for the X and Y axes. Use `alpha=.6`. - -[NOTE] -==== -This may take quite a few minutes to create. Before creating a plot with the entire `myDF`, use `myDF[1:10,]`. If you are in a time crunch, the minimum number of points to plot to get full credit is 100, but if you wait, the plot is a bit more telling. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project11.adoc deleted file mode 100644 index 2e16fef19..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project11.adoc +++ /dev/null @@ -1,193 +0,0 @@ -= STAT 29000: Project 11 -- Spring 2021 - -**Motivation:** Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. `tidyverse` is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We have covered a few topics on the `tidyverse` packages, but there is a lot more to learn! We will continue our strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The `tidyverse` consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the `tidyverse` is getting the data to be https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[tidy]. You can read a lot more about this in Hadley Wickham's excellent book, https://r4ds.had.co.nz[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html#what-you-will-learn[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: - 1. Transform - 2. Visualize - 3. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/okcupid/filtered/*.csv` - -== Questions - -[source,r] ----- -datamine_py() -library(tidyverse) -questions <- read_csv2("/class/datamine/data/okcupid/filtered/questions.csv") -users <- read_csv("/class/datamine/data/okcupid/filtered/users.csv") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot[-1242], names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot[-(which(substr(names(users), 1, 1) != "q"))], names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -myDF <- myDF %>% mutate(generation=case_when(d_age<=24 ~ "Gen Z", - between(d_age, 25, 40) ~ "Millenial", - between(d_age, 41, 56) ~ "Gen X", - between(d_age, 57, 66) ~ "Boomers II", - TRUE ~ "Other")) ----- - -[source,r] ----- -ggplot(myDF[1:100,]) + - geom_point(aes(x=d_age, y = lf_min_age, col=gender2), alpha=.6) + - labs(title="Minimum dating age by gender", x="User age", y="Minimum date age") ----- - -=== Question 1 - -Let's pick up where we left in project 10. For those who struggled with project 10, I will post the solutions above either on Saturday morning, or at the latest Monday. Re-run your code from project 10 so we, once again, have our `tibble`, `myDF`. - -At the end of project 10 we created a scatterplot showing `d_age` on the x-axis, and `lf_min_age` on the y-axis. In addition, we colored the points by `gender2`. In many cases, instead of just coloring the different dots, we may want to do the exact _same_ plot for _different_ groups. This can easily be accomplished using `ggplot`. - -*Without* splitting or filtering your data prior to creating the plots, create a graphic with plots for each `generation` where we show `d_age` on the x-axis and `lf_min_age` on the y-axis, colored by `gender2`. - -[IMPORTANT] -==== -You do not need to modify `myDF` at all. -==== - -[IMPORTANT] -==== -This may take quite a few minutes to create. Before creating a plot with the entire myDF, use myDF[1:50,]. If you are in a time crunch, the minimum number of points to plot to get full credit is 500, but if you wait, the plot is a bit more telling. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== - -=== Question 2 - -By default, `facet_wrap` and `facet_grid` maintain the same scale for the x and y axes across the various plots. This makes it easier to compare visually. In this case, it may make it harder to see the patterns that emerge. Modify your code from question (1) to allow each facet to have its own x and y axis limits. - -[TIP] -==== -Look at the argument `scales` in the `facet_wrap`/`facet_grid` functions. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== - -=== Question 3 - -Let's say we have a theory that the older generations tend to smoke more. You decided you want to create a plot that compares the percentage of smokers per `generation`. Before we do this, we need to wrangle the data a bit. - -What are the possible values of `d_smokes`? Create a new column in `myDF` called `is_smoker` that has values `TRUE`, `FALSE`, or `NA` when applicable. You will need to determine how you will assign a user as a smoker or not -- this is up to you! Explain your cutoffs. Make sure you stay in the `tidyverse` to solve this problem. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining your logic and cutoffs for the new `is_smoker` column. -- The `table` of the `is_smoker` column. -==== - -=== Question 4 - -Great! Now that we have our new `is_smoker` column, create a new `tibble` called `smokers_per_gen`. `smokers_per_gen` should be a summary of `myDF` containing the percentage of smokers per `generation`. - -[TIP] -==== -The result, `smokers_per_gen` should have 2 columns: `generation` and `percentage_of_smokers`. It should have the same number of rows as the number of `generations`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create a Cleveland dot plot using `ggplot` to show the percentage of smokers for each different `generation`. Use `ggthemr` to give your plot a new look! You can choose any theme you'd like! - -Is our theory from question (3) correct? Explain why you think so, or not. - -(OPTIONAL I, 0 points) To make the plot have a more aesthetic look, consider reordering the data by percentage of smokers, or even by the age of `generation`. You can do that before passing the data using the `arrange` function, or inside the `geom_point` function, using the `reorder` function. To re-order by `generation`, you can either use brute force, or you can create a new column called `avg_age` while using `summarize`. `avg_age` should be the average age for each group (using the variable `d_age`). You can use this new column, `avg_age` to re-order the data. - -(OPTIONAL II, 0 points) Improve our plot, change the x-axis to be displayed as a percentage. You can use the `scales` package and the function `scale_x_continuous` to accomplish this. - -[TIP] -==== -Use `geom_point` **not** `geom_dotplot` to solve this problem. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -- 1-2 sentences commenting on the theory, and what are your conclusions based on your plot (if any). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project12.adoc deleted file mode 100644 index bf4012a53..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project12.adoc +++ /dev/null @@ -1,119 +0,0 @@ -= STAT 29000: Project 12 -- Spring 2021 - -**Motivation:** As we mentioned before, data wrangling is a big part in any data driven project. https://www.amazon.com/Exploratory-Data-Mining-Cleaning/dp/0471268518["Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis."] Therefore, it is worth to spend some time mastering how to best tidy up our data. - -**Context:** We are continuing to practice using various `tidyverse` packages, in order to wrangle data. - -**Scope:** python - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The first step in any data science project is to define our problem statement. In this project, our goal is to gain insights into customers' behaviours with regards to online orders and restaurant ratings. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/restaurant/*.csv` - -== Questions - -=== Question 1 - -Load the `tidyverse` suite a packages, and read the data from files `orders.csv`, `train_customers.csv`, and `vendors.csv` into `tibble`s named `orders`, `customers`, and `vendors` respectively. - -Take a look the `tibbles` and describe in a few sentences the type of information contained in each dataset. Although the name can be self-explanatory, it is important to get an idea of what exactly we are looking at. For each combination of 2 datasets, which column would you use to join them? - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining each dataset (`orders`, `customers`, and `vendors`). -- 1-2 sentences for each combination of 2 datasets describing if we could combine the datasets or not, and which column you would you use to join them. -==== - -=== Question 2 - -Let's tidy up our datasets a bit prior to joining them. For each dataset, complete the tasks below. - -- `orders`: remove columns from and between `preparationtime` to `delivered_time` (inclusive). -- `customers`: take a look at the column `dob`. Based on its values, what do you believe it was supposed to contain? Can we rely on the numbers selected? Why or why not? Based on your answer, keep the columns `akeed_customer_id`, `gender`, and `dob`, OR just `akeed_customer_id` and `gender`. -- `vendors`: take a look at columns `country_id` and `city_id`. Would they be useful to compare the vendors in our dataset? Why or why not? If not, remove the columns from the dataset. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences describing what columns you kept for `vendors` and `customers` and why. -==== - -=== Question 3 - -Use your solutions from questions (1) and (2), and the join functions from tidyverse (`inner_join`, `left_join`, `right_join`, and `full_join`) to create a single `tibble` called `myDF` containing information only where all 3 `tibbles` intersect. - -For example, we do not want `myDF` to contain orders from customers that are not in `customers` tibble. Which function(s) from the tidyverse did you use to merge the datasets and why? - -[TIP] -==== -`myDF` should have 132,226 rows. -==== - -[TIP] -==== -When combining two datasets, you may want to change the argument `suffix` in the join function to specify from which dataset it came from. For example, when joining `customers` and `orders`: `*_join(customers, orders, suffix = c('_customers', '_orders'))`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences describing which function you used, and why. -==== - -=== Question 4 - -Great, now we have a single, tidy dataset to work with. There are 2 vendor categories in myDF, `Restaurants` and `Sweets & Bakes`. We would expect there to be some differences. Let's compare them using the following variables: `deliverydistance`, `item_count`, `grand_total`, and `vendor_discount_amount`. Our end goal (by the end of question 5) is to create a histogram colored by the vendor's category (`vendor_category_en`), for each variable. - -To accomplish this easily using `ggplot`, we will take advantage of `pivot_longer`. Pivot columns `deliverydistance`, `item_count`, `grand_total`, and `vendor_discount_amount` in `myDF`. The end result should be a `tibble` with columns `variable` and `values`, which contain the name of the pivoted column (`variable`), and values of those columns (`values`) Call this modified dataset `myDF_long`. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Now that we have the data in the ideal format for our plot, create a histogram for each variable. Make sure to color them by vendor category (`vendor_category_en`). How do the two types of vendors compare in these 4 variables? - -[TIP] -==== -Use the argument `fill` instead of `color` in `geom_histogram`. -==== - -[TIP] -==== -You may want to add some transparency to your plot. Add it using `alpha` argument in `geom_histogram`. -==== - -[TIP] -==== -You may want to change the argument `scales` in `facet_*`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 2-3 sentences comparing `Restaurants` and `Sweets & Bakes` for `deliverydistance`, `item_count`, `grand_total` and `vendor_discount_amount`. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project13.adoc deleted file mode 100644 index fb93b6045..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project13.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= STAT 29000: Project 13 -- Spring 2021 - -**Motivation:** Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we've practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data. - -**Context:** We will continue to gain familiarity with the `tidyverse` suite of packages (including `ggplot`), and data wrangling tasks. - -**Scope:** r, tidyverse - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transmute functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -Read the dataset into a `tibble` named `complaintsDF`. This dataset contains consumer complaints for over 5,000 companies. Our goal is to create a `tibble` called `companyDF` containing the following summary information for each company: - -- `Company`: The company name (`Company`) -- `State`: The state (`State`) -- `percent_timely_response`: Percentage of timely complaints (`Timely response?`) -- `percent_consumer_disputed`: Percentage of complaints that were disputed by the consumer (`Consumer disputed?`) -- `percent_submitted_online`: Percentage of complaints that were submitted online (use column `Submitted via`, and consider a submission to be an online submission if it was submitted via `Web` or `Email`) -- `total_n_complaints`: Total number of complaints - -There are various ways to create `companyDF`. Let's practice using the pipes (`%>%`) to get `companyDF`. The idea is that our code at the end of question 2 will look something like this: - -[source,r] ----- -companyDF <- complaintsDF %>% - insert_here_code_to_change_variables %>% # (question 1) - insert_here_code_to_group_and_get_summaries_per_group # (question 2) ----- - -First, create logical columns (columns containing `TRUE` or `FALSE`) for `Timely response?`, `Consumer disputed?` and `Submitted via` named `timely_response_log`, `consumer_disputed_log` and `submitted_online`, respectively. - -`timely_response_log` and `consumer_disputed_log` will have value `TRUE` if `Timely response?` and `Consumer disputed?` have values `Yes` respectively, and `FALSE` if the value for the original column is `No`. `submitted_online` will have value `TRUE` if the the complaint was submitted via `Web` or `Email`. - -You can double check your results for each column by getting a table with the original and modified column, as shown below. In this case, we would want all `TRUE` values to be in row `Yes`, and all `FALSE` to be in row `No`. - -[source,r] ----- -table(companyDF$`Timely response?`, companyDF$timely_response_log) ----- - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Continue the pipeline we started in question (1). Get the summary information for each company. Note that you will need to include more pipes in the pseudo-code from question (1) as we want the summary for _each_ company in _each_ state. If a company is present in 4 states, `companyDF` should have 4 rows for that company -- one for each state. For the rest of the project, we will refer to a company as its unique combination of `Company` and `State`. - -[TIP] -==== -The function `n()` from `dplyr` counts the number of observations in the current group. It can only by used within `mutate`/`transmute`, `filter`, and the `summarize` functions. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Using `ggplot2`, create a scatterplot showing the relationship between `percent_timely_response` and `percent_consumer_disputed` for companies with at least 500 complaints. Based on your results, do you believe there is an association between how timely the company's response is, and whether the consumer disputes? Why or why not? - -[TIP] -==== -Remember, here we consider each row of `companyDF` a unique company. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Which company, with at least 250 complaints, has the highest percent of consumer dispute? - -[IMPORTANT] -==== -We are learning `tidyverse`, so use `tidyverse` functions to solve this problem. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -(OPTIONAL, 0 pts) Create a graph using `ggplot2` that compares `States` based on any columns from `companyDF` or `complaintsDF`. You may need to summarize the data, filter, or even create new variables depending on what your metric of comparison is. Below are some examples of graphs that can be created. Do not feel limited by them. Make sure to change the labels for each axis, add a title, and change the theme. - -- Cleveland's dotplot for the top 10 states with the highest ratio between percent of disputed complaints and timely response. -- Bar graph showing the total number of complaints in each state. -- Scatterplot comparing the percentage of timely responses in the state and average number of complaints per state. -- Line plot, where each line is a state, showing the total number of complaints per year. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -- 1-2 sentences commenting on your plot. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project14.adoc deleted file mode 100644 index 6a62af674..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/29000/29000-s2021-project14.adoc +++ /dev/null @@ -1,131 +0,0 @@ -= STAT 29000: Project 14 -- Spring 2021 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you've learned, and review topics and languages in a generic way. - -**Context:** We are on the last project where we will leave it up to you on how to solve the problems presented. - -**Scope:** python, r, bash, unix, computers - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -- `/class/datamine/data/disney` -- `/class/datamine/data/movies_and_tv/imdb.db` -- `/class/datamine/data/amazon/music.txt` -- `/class/datamine/data/craigslist/vehicles.csv` -- `/class/datamine/data/flights/2008.csv` - -== Questions - -[IMPORTANT] -==== -Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don't feel limited by one language, you can use different languages to answer different questions. If you are feeling bold, you can also try answering the questions using all languages! -==== - -=== Question 1 - -What percentage of flights in 2008 had a delay due to the weather? Use the `/class/datamine/data/flights/2008.csv` dataset to answer this question. - -[TIP] -==== -Consider a flight to have a weather delay if `WEATHER_DELAY` is greater than 0. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - - -=== Question 2 - -Which listed manufacturer has the most expensive previously owned car listed in Craiglist? Use the `/class/datamine/data/craigslist/vehicles.csv` dataset to answer this question. Only consider listings that have listed price less than $500,000 _and_ where manufacturer information is available. - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 3 - -What is the most common and least common `type` of title in imdb ratings? Use the `/class/datamine/data/movies_and_tv/imdb.db` dataset to answer this question. - -[TIP] -==== -Use the `titles` table. -==== - -[TIP] -==== -Don't know how to use SQL yet? To get this data into an R data.frame , for example: - -[source,r] ----- -library(tidyverse) -con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") -myDF <- tbl(con, "titles") ----- -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 4 - -What percentage of music reviews contain the words `hate` or `hated`, and what percentage contain the words `love` or `loved`? Use the `/class/datamine/data/amazon/music.txt` dataset to answer this question. - -[TIP] -==== -It _may_ take a minute to run, depending on the tool you use. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 5 - -What is the best time to visit Disney? Use the data provided in `/class/datamine/data/disney` to answer the question. - -First, you will need determine what you will consider "time", and the criteria you will use. See below some examples. Don't feel limited by them! Be sure to explain your criteria, use the data to investigate, and determine the best time to visit! Write 1-2 sentences commenting on your findings. - -- As Splash Mountain is my favorite ride, my criteria is the smallest monthly average wait times for Splash Mountain between the years 2017 and 2019. I'm only considering these years as I expect them to be more representative. My definition of "best time" will be the "best months". -- Consider "best times" the days of the week that have the smallest wait time on average for all rides, or for certain favorite rides. -- Consider "best times" the season of the year where the park is open for longer hours. -- Consider "best times" the weeks of the year with smallest average high temperature in the day. - -.Items to submit -==== -- The code used to solve the question. -- 1-2 sentences detailing the criteria you are going to use, its logic, and your defition for "best time". -- The answer to the question. -- 1-2 sentences commenting on your answer. -==== - -=== Question 6 - -Finally, use RMarkdown (and its formatting) to outline 3 things you learned this semester from The Data Mine. For each thing you learned, give a mini demonstration where you highlight with text and code the thing you learned, and why you think it is useful. If you did not learn anything this semester from The Data Mine, write about 3 things you _want_ to learn. Provide examples that demonstrate _what_ you want to learn and write about _why_ it would be useful. - -[IMPORTANT] -==== -Make sure your answer to this question is formatted well and makes use of RMarkdown. -==== - -.Items to submit -==== -- 3 clearly labeled things you learned. -- 3 mini-demonstrations where you highlight with text and code the thin you learned, and why you think it is useful. -OR -- 3 clearly labeled things you _want_ to learn. -- 3 examples demonstrating _what_ you want to learn, with accompanying text explaining _why_ you think it would be useful. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project01.adoc deleted file mode 100644 index 10d9ebec8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project01.adoc +++ /dev/null @@ -1,179 +0,0 @@ -= STAT 39000: Project 1 -- Spring 2021 - -**Motivation:** Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like https://arrow.apache.org/faq/[parquet] and https://developers.google.com/protocol-buffers[protobufs] are becoming more common. - -**Context:** In previous semesters we've explored XML. In this project we will refresh our skills and, rather than exploring XML in R, we will use the `lxml` package in Python. This is the first project in a series of 5 projects focused on web scraping in R and Python. - -**Scope:** python, XML - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Match XML terms to sections of XML demonstrating working knowledge. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/apple/health/watch_dump.xml` - -== Resources - -We realize that it may be a while since you've used Python. That's okay! We are going to be taking things at a much more reasonable pace than Spring 2020. - -Some potentially useful resources for the semester include: - -- The STAT 19000 projects. We are easing 19000 students into Python and will post solutions each week. It would be well worth 10 minutes to look over the questions and solutions each week. -- https://towardsdatascience.com/cheat-sheet-for-python-dataframe-r-dataframe-syntax-conversions-450f656b44ca[Here] is a decent cheat sheet that helps you quickly get an idea of how to do something you know how to do in R, in Python. -- https://thedatamine.github.io/the-examples-book/[The Examples Book] -- updating daily with more examples and videos. Be sure to click on the "relevant topics" links as we try to point you to topics with examples that should be particularly useful to solve the problems we assign. - -== Questions - -[IMPORTANT] -==== -It would be well worth your time to read through the XML section of the book, as well as take the time to work through https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[pandas 10 minute intro]. -==== - -=== Question 1 - -A good first step when working with XML is to get an idea how your document is structured. Normally, there should be good documentation that spells this out for you, but it is good to know what to do when you _don't_ have the documentation. Start by finding the "root" node. What is the name of the root node of the provided dataset? - -[TIP] -==== -Make sure to import the `lxml` package first: - -[source,python] ----- -from lxml import etree ----- -==== - -Here are two videos about running Python in RStudio: - -++++ - -++++ - -++++ - -++++ - -And here is a video about XML scraping in Python: - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Remember, XML can be nested. In question (1) we figured out what the root node was called. What are the names of the next "tier" of elements? - -[TIP] -==== -Now that we know the root node, you could use the root node name as a part of your xpath expression. -==== - -[TIP] -==== -As you may have noticed in question (1) the `xpath` method returns a list. Sometimes this list can contain many repeated tag names. Since our goal is to see the names of the second "tier" elements, you could convert the resulting `list` to a `set` to quickly see the unique list as `set`'s only contain unique values. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Continue to explore each "tier" of data until there isn't any left. Name the "full paths" of all of the "last tier" tags. - -[TIP] -==== -Let's say a "last tier" tag is just a path where there are no more nested elements. For example, `/HealthData/Workout/WorkoutRoute/FileReference` is a "last tier" tag. If you try and get the nested elements for it, they don't exist: - -[source,python] ----- -tree.xpath("/HealthData/Workout/WorkoutRoute/FileReference/*") ----- -==== - -[TIP] -==== -Here are 3 of the 7 "full paths": - ----- -/HealthData/Workout/WorkoutRoute/FileReference -/HealthData/Record/MetadataEntry -/HealthData/ActivitySummary ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -At this point in time you may be asking yourself "but where is the data"? Depending on the structure of the XML file, the data could either be between tags like: - -[source,HTML] ----- -mydata ----- - -Or, it could be in an attribute: - -[source,HTML] ----- -What is cat spelled backwards? ----- - -Collect the "ActivitySummary" data, and convert the list of dicts to a `pandas` DataFrame. The following is an example of converting a list of dicts to a `pandas` DataFrame called `myDF`: - -[source,python] ----- -import pandas as pd -list_of_dicts = [] -list_of_dicts.append({'columnA': 1, 'columnB': 2}) -list_of_dicts.append({'columnB': 4, 'columnA': 1}) -myDF = pd.DataFrame(list_of_dicts) ----- - -[TIP] -==== -It is important to note that an element's "attrib" attribute looks and feels like a `dict`, but it is actually a `lxml.etree._Attrib`. If you try to convert a list of `lxml.etree._Attrib` to a `pandas` DataFrame, it will not work out as you planned. Make sure to first convert each `lxml.etree._Attrib` to a `dict` before converting to a DataFrame. You can do so like: - -[source,python] ----- -# this will convert a single `lxml.etree._Attrib` to a dict -my_dict = dict(my_lxml_etree_attrib) ----- -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -`pandas` is a Python package that provides the DataFrame and Series classes. A DataFrame is very similar to a data.frame in R and can be used to manipulate the data within very easily. A Series is the class that handles a single column of a DataFrame. Go through the https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[pandas in 10 minutes] page from the official documentation. Sort, find, and print the top 5 rows of data based on the "activeEnergyBurned" column. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project02.adoc deleted file mode 100644 index 2d7780434..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project02.adoc +++ /dev/null @@ -1,136 +0,0 @@ -= STAT 39000: Project 2 -- Spring 2021 - -**Motivation:** Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering. - -**Context:** In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from The New York Times, and parse through our newly scraped data using xpath expressions. - -**Scope:** python, web scraping, xml, xref:starter-guides:data-science:html.adoc[`html`] - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -You will be extracting your own data from online in this project. There is no base dataset. - -== Questions - -=== Question 1 - -The New York Times is one of the most popular newspapers in the United States. Open a modern browser (preferably Firefox or Chrome), and navigate to https://nytimes.com. - -By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website. You can either right click and click on "view page source", which will pull up a page full of HTML used to render the page. Alternatively, if you want to focus on a single element, an article title, for example, right click on the article title and click on "inspect element". This will pull up an inspector that allows you to see portions of the HTML. - -Click around the website and explore the HTML however you see fit. Open a few front page articles and notice how most articles start with a bunch of really important information, namely: an article title, summary, picture, picture caption, picture source, author portraits, authors, and article datetime. - -For example: - -https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html - -![](./images/nytimes_image.jpg) - -Copy and paste the **h1** element (in its entirety) containing the article title (for the article provided) in an HTML code chunk. Do the same for the same article's summary. - -++++ - -++++ - -.Items to submit -==== -- 2 code chunks containing the HTML requested. -==== - -=== Question 2 - -In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for _new_ data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest. - -For example, given a link to a new nytimes article, do you think you could isolate the article title by using the `id="link-4686dc8b"` attribute of the *h1* tag? Maybe, or maybe not, but it sure seems like "link-4686dc8b" might be unique to the article and not able to be used given a new article. - -Write an xpath expression to isolate the article title, and another xpath expression to isolate the article summary. - -[IMPORTANT] -==== -You do _not_ need to test your xpath expression yet, we will be doing that shortly. -==== - -.Items to submit -==== -- Two xpath expressions in an HTML code chunk. -==== - -=== Question 3 - -Use the `requests` package to scrape the webpage containing our article from questions (1) and (2). Use the `lxml.html` package and the `xpath` method to test out your xpath expressions from question (2). Did they work? Print the content of the elements to confirm. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Here are a list of article links from https://nytimes.com: - -https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html - -https://www.nytimes.com/2021/01/06/technology/personaltech/tech-2021-augmented-reality-chatbots-wifi.html - -https://www.nytimes.com/2021/01/13/movies/letterboxd-growth.html - -Write a function called `get_article_and_summary` that accepts a string called `link` as an argument, and returns both the article title and summary. Test `get_article_and_summary` out on each of the provided links: - -```{python, eval=F} -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/19/us/politics/trump-china-xinjiang.html') -print(f'Title: {title}, Summary: {summary}') -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/06/technology/personaltech/tech-2021-augmented-reality-chatbots-wifi.html') -print(f'Title: {title}, Summary: {summary}') -title, summary = get_article_and_summary('https://www.nytimes.com/2021/01/13/movies/letterboxd-growth.html') -print(f'Title: {title}, Summary: {summary}') -``` - -[TIP] -==== -The first line of your function should look like this: - -`def get_article_and_summary(myURL: str) -> (str, str):` -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -In question (1) we mentioned a myriad of other important information given at the top of most New York Times articles. Choose *two* other listed pieces of information and copy, paste, and update your solution to question (4) to scrape and return those chosen pieces of information. - -[IMPORTANT] -==== -If you choose to scrape non-textual data, be sure to return data of an appropriate type. For example, if you choose to scrape one of the images, either print the image or return a PIL object. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project03.adoc deleted file mode 100644 index 64bfd8865..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project03.adoc +++ /dev/null @@ -1,190 +0,0 @@ -= STAT 39000: Project 3 -- Spring 2021 - -**Motivation:** Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from https://trulia.com. - -**Context:** In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let's you interact with a browser, selenium. - -**Scope:** python, web scraping, selenium - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Visit https://trulia.com. Many websites have a similar interface, i.e. a bold and centered search bar for a user to interact with. Using `selenium` write Python code that that first finds the `input` element, and then types "West Lafayette, IN" followed by an emulated "Enter/Return". Confirm you code works by printing the url after that process completes. - -[TIP] -==== -You will want to use `time.sleep` to pause a bit after the search so the updated url is returned. -==== - -++++ - -++++ - -That video is already relevant for Question 2 too. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Use your code from question (1) to test out the following queries: - -- West Lafayette, IN (City, State) -- 47906 (Zip) -- 4505 Kahala Ave, Honolulu, HI 96816 (Full address) - -If you look closely you will see that there are patterns in the url. For example, the following link would probably bring up homes in Crawfordsville, IN: https://trulia.com/IN/Crawfordsville. With that being said, if you only had a zip code, like 47933, it wouldn't be easy to guess https://www.trulia.com/IN/Crawfordsville/47933/, hence, one reason why the search bar is useful. - -If you used xpath expressions to complete question (1), instead use a https://selenium-python.readthedocs.io/locating-elements.html#locating-elements[different method] to find the `input` element. If you used a different method, use xpath expressions to complete question (1). - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Let's call the page after a city/state or zipcode search a "sales page". For example: - -![](./images/trulia.png) - -Use `requests` to scrape the entire page: https://www.trulia.com/IN/West_Lafayette/47906/. Use `lxml.html` to parse the page and get all of the `img` elements that make up the house pictures on the left side of the website. - -[IMPORTANT] -==== -Make sure you are actually scraping what you think you are scraping! Try printing your html to confirm it has the content you think it should have: - -[source,python] ----- -import requests -response = requests.get(...) -print(response.text) ----- -==== - -[TIP] -==== -Are you human? Depends. Sometimes if you add a header to your request, it won't ask you if you are human. Let's pretend we are Firefox: - -[source,python] ----- -import requests -my_headers = {'User-Agent': 'Mozilla/5.0'} -response = requests.get(..., headers=my_headers) ----- -==== - -Okay, after all of that work you may have discovered that only a few images have actually been scraped. If you cycle through all of the `img` elements and try to print the value of the `src` attribute, this will be clear: - -[source,python] ----- -import lxml.html -tree = lxml.html.fromstring(response.text) -elements = tree.xpath("//img") -for element in elements: - print(element.attrib.get("src")) ----- - -This is because the webpage is not immediately, _completely_ loaded. This is a common website behavior to make things appear faster. If you pay close to when you load https://www.trulia.com/IN/Crawfordsville/47933/, and you quickly scroll down, you will see images still needing to finish rendering all of the way, slowly. What we need to do to fix this, is use `selenium` (instead of `lxml.html`) to behave like a human and scroll prior to scraping the page! Try using the following code to slowly scroll down the page before finding the elements: - -[source,python] ----- -# driver setup and get the url -# Needed to get the window size set right and scroll in headless mode -myheight = driver.execute_script('return document.body.scrollHeight') -driver.set_window_size(1080,myheight+100) -def scroll(driver, scroll_point): - driver.execute_script(f'window.scrollTo(0, {scroll_point});') - time.sleep(5) - -scroll(driver, myheight*1/4) -scroll(driver, myheight*2/4) -scroll(driver, myheight*3/4) -scroll(driver, myheight*4/4) -# find_elements_by_* ----- - -[TIP] -==== -At the time of writing there should be about 86 links to images of homes. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Write a function called `avg_house_cost` that accepts a zip code as an argument, and returns the average cost of the first page of homes. Now, to make this a more meaningful statistic, filter for "3+" beds and _then_ find the average. Test `avg_house_cost` out on the zip code `47906` and print the average costs. - -[IMPORTANT] -==== -Use `selenium` to "click" on the "3+ beds" filter. -==== - -[TIP] -==== -If you get an error that tells you `button` is not clickable because it is covered by an `li` element, try clicking on the `li` element instead. -==== - -[TIP] -==== -You will want to wait a solid 10-15 seconds for the sales page to load before trying to select or click on anything. -==== - -[TIP] -==== -Your results may end up including prices for "Homes Near \". This is okay. Even better if you manage to remove those results. If you _do_ choose to remove those results, take a look at the `data-testid` attribute with value `search-result-list-container`. Perhaps only selecting the children of the first element will get the desired outcome. -==== - -[TIP] -==== -You can use the following code to remove the non-numeric text from a string, and then convert to an integer: - -[source,python] ----- -import re -int(re.sub("[^0-9]", "", "removenon45454_numbers$")) ----- -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Get creative. Either add an interesting feature to your function from (4), or use `matplotlib` to generate some sort of accompanying graphic with your output. Make sure to explain what your additions do. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project04.adoc deleted file mode 100644 index fcce6421b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project04.adoc +++ /dev/null @@ -1,304 +0,0 @@ -= STAT 39000: Project 4 -- Spring 2021 - -**Motivation:** In this project we will continue to hone your web scraping skills, introduce you to some "gotchas", and give you a little bit of exposure to a powerful tool called cron. - -**Context:** We are in the second to last project focused on web scraping. This project will introduce some supplementary tools that work well with web scraping: cron, sending emails from Python, etc. - -**Scope:** python, web scraping, selenium, cron - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use the beautifulsoup4 package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Check out the following website: https://project4.tdm.wiki - -Use `selenium` to scrape and print the 6 colors of pants offered. - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -You _may_ have to interact with the webpage for certain elements to render. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Websites are updated frequently. You can imagine a scenario where a change in a website is a sign that there is more data available, or that something of note has happened. This is a fake website designed to help students emulate real changes to a website. Specifically, there is one part of the website that has two possible states (let's say, state `A` and state `B`). Upon refreshing the website, or scraping the website again, there is an $$x%$$ chance that the website will be in state `A` and a $$1-x%$$ chance the website will be in state `B`. - -Describe the two states (the thing (element or set of elements) that changes as you refresh the page), and scrape the website enough to estimate $$x$$. - -++++ - -++++ - -[TIP] -==== -You _will_ need to interact with the website to "see" the change. -==== - -[TIP] -==== -Since we are just asking about a state, and not any specific element, you could use the `page_source` attribute of the `selenium` driver to scrape the entire page instead of trying to use xpath expressions to find a specific element. -==== - -[TIP] -==== -Your estimate of $$x$$ does not need to be perfect. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- What state `A` and `B` represent. -- An estimate for `x`. -==== - -=== Question 3 - -Dig into the changing "thing" from question (2). What specifically is changing? Use selenium and xpath expressions to scrape and print the content. What are the two possible values for the content? - -++++ - -++++ - -[TIP] -==== -Due to the changes that occur when a button is clicked, I'd highly advice you to use the `data-color` attribute in your xpath expression instead of `contains(text(), 'blahblah')`. -==== - -[TIP] -==== -`parent::` and `following-sibling::` may be useful https://www.w3schools.com/xml/xpath_axes.asp[xpath axes] to use. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -The following code allows you to send an email using Python from your Purdue email account. Replace the username and password with your own information and send a test email to yourself to ensure that it works. - -++++ - -++++ - -[IMPORTANT] -==== -Do **NOT** include your password in your homework submission. Any time you need to type your password in you final submission just put something like "SUPERSECRETPASSWORD" or "MYPASSWORD". -==== - -[TIP] -==== -To include an image (or screenshot) in RMarkdown, try `![](./my_image.png)` where `my_image.png` is inside the same folder as your `.Rmd` file. -==== - -[TIP] -==== -The spacing and tabs near the `message` variable are very important. Make sure to copy the code exactly. Otherwise, your subject may not end up in the subject of your email, or the email could end up being blank when sent. -==== - -[TIP] -==== -Questions 4 and 5 were inspired by examples and borrowed from the code found at the https://realpython.com/python-send-email/[Real Python] website. -==== - -[source,python] ----- -def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message): - import smtplib, ssl - from email.mime.text import MIMEText - from email.mime.multipart import MIMEMultipart - - message = MIMEMultipart("alternative") - message["Subject"] = my_subject - message["From"] = my_purdue_email - message["To"] = to - - # Create the plain-text and HTML version of your message - text = f'''\ -Subject: {my_subject} -To: {to} -From: {my_purdue_email} - -{my_message}''' - html = f'''\ - - - {my_message} - - -''' - # Turn these into plain/html MIMEText objects - part1 = MIMEText(text, "plain") - part2 = MIMEText(html, "html") - - # Add HTML/plain-text parts to MIMEMultipart message - # The email client will try to render the last part first - message.attach(part1) - message.attach(part2) - - context = ssl.create_default_context() - with smtplib.SMTP("smtp.purdue.edu", 587) as server: - server.ehlo() # Can be omitted - server.starttls(context=context) - server.ehlo() # Can be omitted - server.login(my_purdue_email, my_password) - server.sendmail(my_purdue_email, to, message.as_string()) - -# this sends an email from kamstut@purdue.edu to mdw@purdue.edu -# replace supersecretpassword with your own password -# do NOT include your password in your homework submission. -send_purdue_email("kamstut@purdue.edu", "supersecretpassword", "mdw@purdue.edu", "put subject here", "put message body here") ----- - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- Screenshot showing your received the email. -==== - -=== Question 5 - -The following is the content of a new Python script called `is_in_stock.py`: - -[source,python] ----- -def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message): - import smtplib, ssl - from email.mime.text import MIMEText - from email.mime.multipart import MIMEMultipart - - message = MIMEMultipart("alternative") - message["Subject"] = my_subject - message["From"] = my_purdue_email - message["To"] = to - - # Create the plain-text and HTML version of your message - text = f'''\ -Subject: {my_subject} -To: {to} -From: {my_purdue_email} - -{my_message}''' - html = f'''\ - - - {my_message} - - -''' - # Turn these into plain/html MIMEText objects - part1 = MIMEText(text, "plain") - part2 = MIMEText(html, "html") - - # Add HTML/plain-text parts to MIMEMultipart message - # The email client will try to render the last part first - message.attach(part1) - message.attach(part2) - - context = ssl.create_default_context() - with smtplib.SMTP("smtp.purdue.edu", 587) as server: - server.ehlo() # Can be omitted - server.starttls(context=context) - server.ehlo() # Can be omitted - server.login(my_purdue_email, my_password) - server.sendmail(my_purdue_email, to, message.as_string()) - -def main(): - # scrape element from question 3 - - # does the text indicate it is in stock? - - # if yes, send email to yourself telling you it is in stock. - - # otherwise, gracefully end script using the "pass" Python keyword -if __name__ == "__main__": - main() ----- - -First, make a copy of the script in your `$HOME` directory: - -[source,bash] -cp /class/datamine/data/scraping/is_in_stock.py $HOME/is_in_stock.py -``` - -If you now look in the "Files" tab in the lower right hand corner of RStudio, and click the refresh button, you should see the file `is_in_stock.py`. You can open and modify this file directly in RStudio. Before you do so, however, change the permissions of the `$HOME/is_in_stock.py` script so only YOU can read, write, and execute it: - -[source,bash] ----- -chmod 700 $HOME/is_in_stock.py ----- - -The script should now appear in RStudio, in your home directory, with the correct permissions. Open the script (in RStudio) and fill in the `main` function as indicated by the comments. We want the script to scrape to see whether the pants from question 3 are in stock or not. - -A cron job is a task that runs at a certain interval. Create a cron job that runs your script, `/class/datamine/apps/python/f2020-s2021/env/bin/python $HOME/is_in_stock.py` every 5 minutes. Wait 10-15 minutes and verify that it is working properly. The long path, `/class/datamine/apps/python/f2020-s2021/env/bin/python` simply makes sure that our script is run with access to all of the packages in our course environment. `$HOME/is_in_stock.py` is the path to your script (`$HOME` expands or transforms to `/home/`). - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -If you struggle to use the text editor used with the `crontab -e` command, be sure to continue reading the cron section of the book. We highlight another method that may be easier. -==== - -[TIP] -==== -Don't forget to copy your import statements from question (3) as well. -==== - -[IMPORTANT] -==== -Once you are finished with the project, if you no longer wish to receive emails every so often, follow the instructions here to remove the cron job. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- The content of your cron job in a bash code chunk. -- The content of your `is_in_stock.py` script. -==== - -=== Question 6 - -Take a look at the byline of each pair of pants (the sentences starting with "Perfect for..."). Inspect the HTML. Try and scrape the text using xpath expressions like you normally would. What happens? Are you able to scrape it? Google around and come up with your best explanation of what is happening. - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -- An explanation of what is happening. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project05.adoc deleted file mode 100644 index b390f1914..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project05.adoc +++ /dev/null @@ -1,162 +0,0 @@ -= STAT 39000: Project 5 -- Spring 2021 - -**Motivation:** One of the best things about learning to scrape data is the many applications of the skill that may pop into your mind. In this project, we want to give you some flexibility to explore your own ideas, but at the same time, add a couple of important tools to your tool set. We hope that you've learned a lot in this series, and can think of creative ways to utilize your new skills. - -**Context:** This is the last project in a series focused on scraping data. We have created a couple of very common scenarios that can be problematic when first learning to scrape data, and we want to show you how to get around them. - -**Scope:** python, web scraping, etc. - -.Learning objectives -**** -- Use the requests package to scrape a web page. -- Use the lxml/selenium package to filter and parse data from a scraped web page. -- Learn how to step around header-based filtering. -- Learn how to handle rate limiting. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -It is not uncommon to be blocked from scraping a website. There are a variety of strategies that they use to do this, and in general they work well. In general, if a company wants you to extract information from their website, they will make an API (application programming interface) available for you to use. One method (that is commonly paired with other methods) is blocking your request based on _headers_. You can read about headers https://developer.mozilla.org/en-US/docs/Glossary/Request_header[here]. In general, you can think of headers as some extra data that gives the server or client context. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers[Here] is a list of headers, and some more explanation. - -Each header has a purpose. One common header is called the https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[User-Agent header]. A User-Agent looks something like: - ----- -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 ----- - -You can see headers if you open the console in Firefox or Chrome and load a website. It will look something like: - -![](./images/headers01.png) - -From the https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[mozilla link], this header is a string that "lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent." Basically, if you are browsing the internet with a common browser, the server will know what you are using. In the provided example, we are using Firefox 86 from Mozilla, on a Mac running Mac OS 10.16 with an Intel processor. - -When we send a request from a package like `requests` in Python, here is what the headers look like: - -[source,python] ----- -import requests -response = requests.get("https://project5-headers.tdm.wiki") -print(response.request.headers) ----- - ----- -{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} ----- - -As you can see our User-Agent is `python-requests/2.25.1`. You will find that many websites block requests made from anything such user agents. One such website is: https://project5-headers.tdm.wiki. - -Scrape https://project5-headers.tdm.wiki from Scholar and explain what happens. What is the response code, and what does that response code mean? Can you ascertain what you would be seeing (more or less) in a browser based on the text of the response (the actual HTML)? Read https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers[this section of the documentation for the `headers` package], and attempt to "trick" https://project5-headers.tdm.wiki into presenting you with the desired information. The desired information should look something like: - ----- -Hostname: c1de5faf1daa -IP: 127.0.0.1 -IP: 172.18.0.4 -RemoteAddr: 172.18.0.2:34520 -GET / HTTP/1.1 -Host: project5-headers.tdm.wiki -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 -Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 -Accept-Encoding: gzip -Accept-Language: en-US,en;q=0.5 -Cdn-Loop: cloudflare -Cf-Connecting-Ip: 107.201.65.5 -Cf-Ipcountry: US -Cf-Ray: 62289b90aa55f975-EWR -Cf-Request-Id: 084d3f8e740000f975e0038000000001 -Cf-Visitor: {"scheme":"https"} -Cookie: __cfduid=d9df5daa57fae5a4e425173aaaaacbfc91613136177 -Dnt: 1 -Sec-Gpc: 1 -Upgrade-Insecure-Requests: 1 -X-Forwarded-For: 123.123.123.123 -X-Forwarded-Host: project5-headers.tdm.wiki -X-Forwarded-Port: 443 -X-Forwarded-Proto: https -X-Forwarded-Server: 6afe64faffaf -X-Real-Ip: 123.123.123.123 ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Response code received (a number), and an explanation of what that HTTP response code means. -- What you would (probably) be seeing in a browser if you were blocked. -- Python code used to "trick" the website into being scraped. -- The content of the successfully scraped site. -==== - -=== Question 2 - -Open a browser and navigate to: https://project5-rate-limit.tdm.wiki/. While at first glance, it will seem identical to https://project5-headers.tdm.wiki/, it is not. https://project5-rate-limit.tdm.wiki/ is rate limited based on IP address. Depending on when you are completing this project, this may or may not be obvious. If you refresh your browser fast enough, instead of receiving a bunch of information, you will receive text that says "Too Many Requests". - -The following function tries to scrape the `Cf-Request-Id` header which will have a unique value each request: - -[source,python] ----- -import requests -import lxml.html -def scrape_cf_request_id(url): - resp = requests.get(url) - tree = lxml.html.fromstring(resp.text) - content = tree.xpath("//p")[0].text.split('\n') - cfid = [l for l in content if 'Cf-Request-Id' in l][0].split()[1] - return cfid ----- - -You can test it out: - -[source,python] ----- -scrape_cf_request_id("https://project5-rate-limit.tdm.wiki") ----- - -Write code to scrape 10 unique `Cf-Request-Id`s (in a loop), and save them to a list called `my_ids`. What happens when you run the code? This is caused by our expected text not being present. Instead text with "Too Many Requests" is. While normally this error would be something that makes more sense, like an HTTPError or a Timeout Exception, it _could_ be anything, depending on your code. - -One solution that might come to mind is to "wait" between each loop using `time.sleep()`. While yes, this may work, it is not a robust solution. Other users from your IP address may count towards your rate limit and cause your function to fail, the amount of sleep time may change dynamically, or even be manually adjusted to be longer, etc. The best way to handle this is to used something called exponential backoff. - -In a nutshell, exponential backoff is a way to increase the wait time (exponentially) until an acceptable rate is found. https://pypi.org/project/backoff/[`backoff`] is an excellent package to do just that. `backoff`, upon being triggered from a specified error or exception, will wait to "try again" until a certain amount of time has passed. Upon receving the same error or exception, the time to wait will increase exponentially. Use `backoff` to modify the provided `scrape_cf_request_id` function to use exponential backoff when the we alluded to occurs. Test out the modified function in a loop and print the resulting 10 `Cf-Request-Id`s. - -++++ - -++++ - -[NOTE] -==== -`backoff` utilizes decorators. For those interested in learning about decorators, https://realpython.com/primer-on-python-decorators/[this] is an excellent article. -==== - -.Items to submit -==== -- Python code used to solve the problem. -- What happens when you run the function 10 times in a row? -- Fixed code that will work regardless of the rate limiting. -- 10 unique `Cf-Request-Id`s printed. -==== - -=== Question 3 - -You now have a great set of tools to be able to scrape pretty much anything you want from the internet. Now all that is left to do is practice. Find a course appropriate website containing data you would like to scrape. Utilize the tools you've learned about to scrape at least 100 "units" of data. A "unit" is just a representation of what you are scraping. For example, a unit could be a tweet from Twitter, a basketball player's statistics from sportsreference, a product from Amazon, a blog post from your favorite blogger, etc. - -The hard requirements are: - -- Documented code with thorough comments explaining what the code does. -- At least 100 "units" scraped. -- The data must be from multiple web pages. -- Write at least 1 function (with a docstring) to help you scrape. -- A clear explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example a `head` of a pandas dataframe containing the data). - -.Items to submit -==== -- Python code that scrapes 100 unites of data (with thorough comments explaining what the code does). -- The data must be from more than a single web page. -- 1 or more functions (with docstrings) used to help you scrape/parse data. -- Clear documentation and explanation of what your scraper scrapes, challenges you encountered (if any) and how you overcame them, and a sample of your data printed out (for example using the `head` of a dataframe containing the data). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project06.adoc deleted file mode 100644 index 5fe14b2b8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project06.adoc +++ /dev/null @@ -1,43 +0,0 @@ -= STAT 39000: Project 6 -- Spring 2021 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in _many_ fields. It can be pretty fun too! In this project, you can pick and choose if a couple of different plotting projects. - -**Context:** We've been working hard all semester, learning a lot about web scraping. In this project, you are given the choice between a project designed to go through some `matplotlib` basics, and a project that has you replicate plots from a book using `plotly` (an interactive plotting package) inside a Jupyter Notebook (which you would submit instead of an RMarkdown file). - -**Scope:** python, visualizing data - -.Learning objectives -**** -- Demostrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -++++ - -++++ - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Option 1 - -xref:29000-s2021-project06.adoc[Our 29000 project (that may be too familiar to you by now)] - -== Option 2 - -https://github.com/oscarperpinan/bookvis/tree/master/figs[Here] are a variety of interesting graphics from the popular book https://oscarperpinan.github.io/spacetime-vis/[Displaying time series, spatial and space-time data with R] by Oscar Perpinan Lamigueiro. You can replicate the graphics using data found https://github.com/oscarperpinan/bookvis/tree/master/data[here]. - -Choose 3 graphics from the book to replicate using `plotly`. The replications do not need to be perfect -- a strong effort to get as close as possible is fine. Feel free to change colors as you please. If you have the desire to improve the graphic, please feel free to do so and explain how it is an improvement. - -Use https://notebook.scholar.rcac.purdue.edu and the f2020-s2021 kernel to complete this project. The _only_ thing you need to submit for this project is the downloaded .ipynb file. Make sure that the grader will be able to click "run all" (using the same kernel, f2020-s2021), and have everything run properly. - -[IMPORTANT] -==== -The object of this project is to challenge yourself (as much as you want), learn about and mess around with `plotly`, and be creative. If you have an idea for a cool plot, graphic, or modification, please include it! -==== - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project07.adoc deleted file mode 100644 index f53451b55..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project07.adoc +++ /dev/null @@ -1,134 +0,0 @@ -= STAT 39000: Project 7 -- Spring 2021 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in _many_ fields. It can be pretty fun too! As you probably noticed in the previous project, `matplotlib` can be finicky -- certain types of plots are really easy to create, while others are not. For example, you would think changing the color of a boxplot would be easy to do in `matplotlib`, perhaps we just need to add an option to the function call. As it turns out, this isn't so straightforward (as illustrated at the end of xref:programming-languages:python:matplotlib.adoc[`matplotlib` section]). Occasionally this will happen and that is when packages like `seaborn` or `plotnine` (both are packages built using `matplotlib`) can be good. In this project we will explore this a little bit, and learn about some useful `pandas` functions to help shape your data in a format that any given package requires. - -**Context:** In the next project, we will continue to learn about and become comfortable using `matplotlib`, `seaborn`, and `plotnine`. - -**Scope:** python, visualizing data - -.Learning objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/apple/health/watch_dump.xml` - -== Questions - -=== Question 1 - -In an earlier project we explored some XML data in the form of an Apple Watch data dump. Most health-related apps give you some sort of graph or set of graphs as an output. Use any package you want to parse the XML data. There are a lot of `Records` in this dataset. Each `Record` has an attribute called `creationDate`. Create a barplot of the number of `Records` per day. Make sure your plot is polished, containing proper labels and good colors. - -[TIP] -==== -You could start by parsing out the required data into a `pandas` dataframe or series. -==== - -[TIP] -==== -The `groupby` method is one of the most useful `pandas` methods. It allows you to quickly perform operations on groups of data. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== - -=== Question 2 - -The plot in question 1 should look bimodal. Let's focus only on the first apparent group of readings. Create a new dataframe containing only the readings for the time period from 9/1/2017 to 5/31/2019. How many `Records` are there in that time period? - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - - -=== Question 3 - -It is hard to discern weekly patterns (if any) based on the graphics created so far. For the period of time in question 2, create a labeled bar plot for the count of `Record`s by day of the week. What (if any) discernable patterns are there? Make sure to include the labels provided below: - -[source,python] ----- -labels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] ----- - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== - -=== Question 4 - -Create a `pandas` dataframe containing the following data from `watch_dump.xml`: - -- A column called `bpm` with the `bpm` (beats per minute) of the `InstantaneousBeatsPerMinute`. -- A column called `time` with the `time` of each individual `bpm` reading in `InstantaneousBeatsPerMinute`. -- A column called `date` with the date. -- A column called `dayofweek` with the day of the week. - -[TIP] -==== -You may want to use `pd.to_numeric` to convert the `bpm` column to a numeric type. -==== - -[TIP] -==== -This is one way to convert the numbers 0-6 to days of the week: - -[source,python] ----- -myDF['dayofweek'] = myDF['dayofweek'].map({0:"Mon", 1:"Tue", 2:"Wed", 3:"Thu", 4:"Fri", 5: "Sat", 6: "Sun"}) ----- -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create a heatmap using `seaborn`, where the y-axis shows the day of the week ("Mon" - "Sun"), the x-axis shows the hour, and the values on the interior of the plot are the average `bpm` by hour by day of the week. - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code (including the graphic). -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project08.adoc deleted file mode 100644 index dbbbd0a50..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project08.adoc +++ /dev/null @@ -1,452 +0,0 @@ -= STAT 39000: Project 8 -- Spring 2021 - -**Motivation:** Python is an https://www.geeksforgeeks.org/internal-working-of-python/[interpreted language] (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R -- also an interpreted language), we can run and evaluate a line of code easily using a https://en.wikipedia.org/wiki/Read-eval-print_loop[repl]. In fact, this is the way you've been using Python to date -- selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI's (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks. - -**Context:** This is the first (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** python - -.Learning objectives -**** -- Write a python script that accepts user inputs and returns something useful. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Often times the deliverable part of a project isn't custom built packages or modules, but a script. A script is a .py file with python code written inside to perform action(s). Python scripts are incredibly easy to run, for example, if you had a python script called `question01.py`, you could run it by opening a terminal and typing: - -[source,bash] ----- -python3 /path/to/question01.py ----- - -The python interpreter then looks for the scripts entrypoint, and starts executing. You should read https://realpython.com/python-main-function/[this] article about the main function and python scripts. In addition, read https://realpython.com/run-python-scripts/#using-the-script-filename[this] section, paying special attention to the shebang. - -Create a Python script called `question01.py` in your `$HOME` directory. Use the second shebang from the article: `#!/usr/bin/env python3`. When run, `question01.py` should use the `sys` package to print the location of the interpreter being used to run the script. For example, if we started a Python interpreter in RStudio using the following code: - -[source,r] ----- -datamine_py() -reticulate::repl_python() ----- - -Then, we could print the interpreter by running the following Python code one line at a time: - -[source,python] ----- -import sys -print(sys.executable) ----- - -Since we are using our Python environment, you should see this result: `/class/datamine/apps/python/f2020-s2021/env/bin/python3`. This is the fully qualified path of the Python interpreter we've been using for this course. - -Restart your R session by clicking `Session > Restart R`, navigate to the "Terminal" tab in RStudio, and run the following lines in the terminal. What is the output? - -[source,bash] ----- -# this command gives execute permissions to your script -- this only needs to be run once -chmod +x $HOME/question01.py -# execute your script -$HOME/question01.py ----- - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entire `question01.py` script's contents in a Python code chunk with chunk option "eval=F". -- Output from running your code copy and pasted as text. -==== - -=== Question 2 - -Was your output in question (1) expected? Why or why not? - -When we restarted the R session, our `datamine_py`'s effects were reversed, and the default Python interpreter is no longer our default when running `python3`. It is very common to have a multitude of Python environments available to use. But, when we are running a Python script it is _not_ convenient to have to run various commands (in our case, the single `datamine_py` command) in order to get our script to run the way we want it to run. In addition, if our script used a set of packages that were not installed outside of our course environment, the script would fail. - -In this project, since our focus is more on how to write scripts and make them work as expected, we will have some fun and experiment with some pre-trained state of the art machine learning models. - -The following function accepts a string called `sentence` as an input and returns the sentiment of the sentence, "POSITIVE" or "NEGATIVE". - -[source,python] ----- -from transformers import pipeline -def get_sentiment(model, sentence: str) -> str: - result = model(sentence) - - return result[0].get('label') -model = pipeline('sentiment-analysis') -print(get_sentiment(model, 'This is really great!')) -print(get_sentiment(model, 'Oh no! Thats horrible!')) ----- - -Include `get_sentiment` (including the import statement) in a new script, `question02.py` script. Note that you do not have to _use_ `get_sentiment` anywhere, just include it for now. Go to the terminal in RStudio and execute your script. What happens? - -Remember, since our current shebang is `#!/usr/bin/env python3`, if our script uses one or more packages that are not installed in the current environment environment, the script will fail. This is what is happening. The `transformers` package that we use is not installed in the current environment. We do, however, have an environment that _does_ have it installed, and it is located on Scholar at: `/class/datamine/apps/python/pytorch2021/env/bin/python`. Update the script's shebang and try to run it again. Does it work now? - -Depending on the state of your current environment, the original shebang, `#!/usr/bin/env python3` will use the same Python interpreter and environment that is currently set to `python3` (run `which python3` to see). If you haven't run `datamine_py`, this will be something like: `/apps/spack/scholar/fall20/apps/anaconda/2020.11-py38-gcc-4.8.5-djkvkvk/bin/python` or `/usr/bin/python`, if you _have_ run `datamine_py`, this will be: `/class/datamine/apps/python/f2020-s2021/env/bin/python`. _Both_ environments lack the `transformers` package. Our other environment whose interpreter lives here: `/class/datamine/apps/python/pytorch2021/env/bin/python` _does_ have this package. The shebang is then critically important for any scripts that want to utilize packages from a specific environment. - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -.Items to submit -==== -- Sentence explaining why or why not the output from question (1) was expected. -- Sentence explaining what happens when you include `get_sentiment` in your script and try to execute it. -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -==== - -=== Question 3 - -Okay, great. We now understand that if we want to use packages from a specific environment, we need to modify our shebang accordingly. As it currently stands, our script is pretty useless. Modify the script, in a new script called `question03.py` to accept a single argument. This argument should be a sentence. Your script should then print the sentence, and whether or not the sentence is "POSITIVE" or "NEGATIVE". Use `sys.argv` to accomplish this. Make sure the script functions in the following way: - -[source,bash] ----- -$HOME/question03.py This is a happy sentence, yay! ----- - ----- -Too many arguments. ----- - -[source,bash] ----- -$HOME/question03.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question03.py ----- - ----- -./question03.py requires at least 1 argument, "sentence". ----- - -[TIP] -==== -One really useful way to exit the script and print a message is like this: - -[source,python] ----- -import sys -sys.exit(f"{__file__} requires at least 1 argument, 'sentence'") ----- -==== - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 4 - -If you look at the man pages for a command line tool like `awk` or `grep` (you can get these by running `man awk` or `man grep` in the terminal), you will see that typically CLI's have a variety of options. Options usually follow the following format: - -[source,bash] ----- -grep -i 'ok' some_file.txt ----- - -However, often times you have 2 ways you can use an option -- either with the short form (for example `-i`), or long form (for example `-i` is the same as `--ignore-case`). Sometimes options can get values. If options don't have values, you can assume that the presence of the flag means `TRUE` and the lack means `FALSE`. When using short form, the value for the option is separated by a space (for example `grep -f my_file.txt`). When using long form, the value for the option is separated by an equals sign (for example `grep --file=my_file.txt`). - -Modify your script (as a new `question04.py`) to include an option called `score`. When active (`question04.py --score` or `question04.py -s`), the script should return both the sentiment, "POSITIVE" or "NEGATIVE" and the probability of being accurate. Make sure that you modify your checks from question 3 to continue to work whenever we use `--score` or `-s`. Some examples below: - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question04.py --score 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py -s 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' -s ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --score ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --value ----- - ----- -Unknown option(s): ['--value'] ----- - -[source,bash] ----- -$HOME/question04.py 'This is a happy sentence, yay!' --value --score ----- - ----- -Too many arguments. ----- - -[source,bash] ----- -$HOME/question04.py ----- - ----- -question04.py requires at least 1 argument, "sentence" ----- - -[source,bash] ----- -$HOME/question04.py --score ----- - ----- -./question04.py requires at least 1 argument, "sentence". No sentence provided. ----- - -[source,bash] ----- -$HOME/question04.py 'This is one sentence' 'This is another' ----- - ----- -./question04.py requires only 1 sentence, but 2 were provided. ----- - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -[TIP] -==== -Experiment with the provided function. You will find the probability of being accurate is already returned by the model. -==== - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 5 - -Wow, that is an extensive amount of logic for for a single option. Luckily, Python has the `argparse` package to help you build CLI's and handle situations like this. You can find the documentation for argparse https://docs.python.org/3/library/argparse.html[here] and a nice little tutorial https://docs.python.org/3/howto/argparse.html[here]. Update your script (as a new `question05.py`) using `argparse` instead of custom logic. Specifically, add 1 positional argument called "sentence", and 1 optional argument "--score" or "-s". You should handle the following scenarios: - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE ----- - -[source,bash] ----- -$HOME/question05.py --score 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py -s 'This is a happy sentence, yay!' ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' -s ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --score ----- - ----- -Our sentence is: This is a happy sentence, yay! -POSITIVE: 0.999848484992981 ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --value ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: --value ----- - -[source,bash] ----- -$HOME/question05.py 'This is a happy sentence, yay!' --value --score ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: --value ----- - -[source,bash] ----- -$HOME/question05.py ----- - ----- -usage: question05.py [-h] [-s] sentence -positional arguments: - sentence -optional arguments: - -h, --help show this help message and exit - -s, --score display the probability of accuracy ----- - -[source,bash] ----- -$HOME/question05.py --score ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: too few arguments ----- - -[source,bash] ----- -$HOME/question05.py 'This is one sentence' 'This is another' ----- - ----- -usage: question05.py [-h] [-s] sentence -question05.py: error: unrecognized arguments: This is another ----- - -[TIP] -==== -A good way to print the help information if no arguments are provided is: - -[source,python] ----- -if len(sys.argv) == 1: - parser.print_help() - parser.exit() ----- -==== - -[IMPORTANT] -==== -Include the bash code chunk option `error=T` to enable RMarkdown to knit and output errors. -==== - -[IMPORTANT] -==== -You can run bash code using a bash code chunk just like you would an R or Python code chunk. Simply replace "python" with "bash" in the code chunk options. -==== - -++++ - -++++ - -.Items to submit -==== -- Python code used to solve the problem. -- Output from running your code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project09.adoc deleted file mode 100644 index 973a6b0a8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project09.adoc +++ /dev/null @@ -1,318 +0,0 @@ -= STAT 39000: Project 9 -- Spring 2021 - -**Motivation:** In the previous project you worked through some common logic needed to make a good script. By the end of the project `argparse` was (hopefully) a welcome package to be able to use. In this project, we are going to continue to learn about `argparse` and create a CLI for the https://data.whin.org[WHIN Data Portal]. In doing so, not only will we get to practice using `argparse`, but you will also get to learn about using an API to retrieve data. An API (application programming interface) is a common way to retrieve structured data from a company or resource. It is common for large companies like Twitter, Facebook, Google, etc. to make certain data available via API's, so it is important to get some exposure. - -**Context:** This is the second (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** python - -.Learning objectives -**** -- Write a python script that accepts user inputs and returns something useful. -- Interact with an API to retrieve data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will involve retrieving data using an API. Instructions and hints will be provided as we go. - -== Questions - -=== Question 1 - -WHIN (Wabash Heartland Innovation Network) has deployed hundreds of weather stations across the region so farmers can use the data collected to become more efficient, save time, and increase yields. WHIN has kindly granted access to 20+ public-facing weather stations for educational purposes. - -Navigate to https://data.whin.org/data/current-conditions, and click on the "CREATE ACCOUNT" button in the middle of the screen: - -![](./images/p9_01.png) - -Click on "I'm a student or educator": - -![](./images/p9_02.png) - -Enter your information. For "School or Organization" please enter "Purdue University". For "Class or project", please put "The Data Mine Project 9". For the description, please put "We are learning about writing scripts by writing a CLI to fetch data from the WHIN API." Please use your purdue.edu email address. Once complete, click "Next". - -Carefully read the LICENSE TERMS before accepting, and confirm your email address if needed. Upon completion, navigate here: https://data.whin.org/data/current-conditions - -Read about the API under "API Usage". An endpoint is the place (in this case the end of a URL (which can be referred to as the URI)) that you can use to access/delete/update/etc. a given resource depending on the HTTP method used. What are the 3 endpoints of this API? - -Write and run a script called `question01.py` that, when run, tries to print the current listing of the weather stations. Instead of printing what you think it should print, it will print something else. What happened? - -[source,bash] ----- -$HOME/question01.py ----- - -[TIP] -==== -You can use the `requests` library to run the HTTP GET method on the endpoint. For example: -==== - -[source,python] ----- -import requests -response = requests.get("https://datamine.purdue.edu/") -print(response.json()) ----- - -[TIP] -==== -We want to use our regular course environment, therefore, make sure to use the following shebang: `#!/class/datamine/apps/python/f2020-s2021/env/bin/python` -==== - -++++ - -++++ - -.Items to submit -==== -- List the 3 endpoints for this API. -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 2 - -In question 1, we quickly realize that we are missing a critical step -- authentication! Recall that authentication is the process of a system understanding _who_ a person is, authorization is the process of telling whether or not somebody (or something) has permissions to access/modify/delete/etc. a resource. When we make GET requests to https://data.whin.org/api/weather/stations, or any other endpoint, the server that returns the data we are trying to access has no clue who we are, which explains the result from question 1. - -While there are many methods of authentication, WHIN is using Bearer tokens. Navigate https://data.whin.org/account[here]. Take a look at your account info. You should see a large chunk of random numbers and text. This is your bearer token that you can use for authentication. The bearer token is to be sent in the "Authorization" header of the request. For example: - -[source,python] ----- -import requests -my_headers = {"Authorization": "Bearer LDFKGHSOIDFRUTRLKJNXDFGT"} -response = requests.get("my_url", headers = my_headers) ----- - -Update your script (as a new script called `question02.py`), and test it out again to see if we get the expected results now. `question02.py` should only print the first 5 results. - -A couple important notes: - -- The bearer token should be taken care of like a password. You do NOT want to share this, ever. -- There is an inherent risk in saving code like the code shown above. What if you accidentally upload it to GitHub? Then anyone with access could potentially read and use your token. - -How can we include the token in our code without typing it in our code? The typical way to handle this is to use environment variables and/or a file containing the information that is specifically NOT shared unless necessary. For example, create a file called `.env` in your home directory, with the following contents: - -[source,txt] ----- -MY_BEARER_TOKEN=aslgdkjn304iunglsejkrht09 -SOME_OTHER_VARIABLE=some_other_value ----- - -In this file, replace the "aslgdkj..." part with you actual token and save the file. Then make sure only YOU can read and write to this file by running the following in a terminal: - -[source,bash] ----- -chmod 600 $HOME/.env ----- - -Now, we can use a package called `dotenv` to load the variables in the `$HOME/.env` file into the environment. We can then use the `os` package to get the environment variables. For example: - -[source,python] ----- -import os -from dotenv import load_dotenv -# This function will load the .env file variables from the same directory as the script into the environment -load_dotenv() -# We can now use os.getenv to get the important information without showing anything. -# Now, all anybody reading the code sees is "os.getenv('MY_BEARER_TOKEN')" even though that is replaced by the actual -# token when the code is run, cool! -my_headers = {"Authorization": f"Bearer {os.getenv('MY_BEARER_TOKEN')}"} ----- - -Update `question02.py` to use `dotenv` and `os.getenv` to get the token from the local `$HOME/.env` file. Test out your script: - -[source,bash] ----- -$HOME/question02.py ----- - -++++ - -++++ - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given example. -==== - -=== Question 3 - -That's not so bad! We now know how to retrieve data from the API as well as load up variables from our environment rather than insecurely just pasting them in our code, great! - -A query parameter is (more or less) some extra information added at the end of the endpoint. For example, the following url has a query parameter called `param` and value called `value`: \https://example.com/some_resource?param=value. You could even add more than one query parameter as follows: \https://example.com/some_resource?param=value&second_param=second_value -- as you can see, now we have another parameter called `second_param` with a value of `second_value`. While the query parameters begin with a `?`, each subsequent parameter is added using `&`. - -Query parameters can be optional or required. API's will sometimes utilize query parameters to filter or fine-tune the returned results. Look at the documentation for the `/api/weather/station-daily` endpoint. Use your newfound knowledge of query parameters to update your script (as a new script called `question03.py`) to retrieve the data for station with id `150` on `2021-01-05`, and print the first 5 results. Test out your script: - -[source,bash] ----- -$HOME/question03.py ----- - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given example. -==== - -=== Question 4 - -Excellent, now let's build our CLI. Call the script `whin.py`. Use your knowledge of `requests`, `argparse`, and API's to write a CLI that replicates the behavior shown below. For convenience, only print the first 2 results for all output. - -[TIP] -==== -- In general, there will be 3 commands: `stations`, `daily`, and `cc` (for current condition). -- You will want to create a subparser for each command: `stations_parser`, `current_conditions_parser`, and `daily_parser`. -- The `daily_parser` will have 2 _position_, _required_ arguments: `station_id` and `date`. -- The `current_conditions_parser` will have 2 _optional_ arguments of type `str`: `--center`/`-c` and `--radius`/`-r`. -- If only one of `--center` or `--radius` is present, you should use `sys.exit` to print a message saying "Need both center AND radius, or neither.". -- To create a subparser, just do the following: - -[source,python] ----- -parser = argparse.ArgumentParser() -subparsers = parser.add_subparsers(help="possible commands", dest="command") -my_subparser = subparsers.add_parser("my_command", help="my help message") -my_subparser.add_argument("--my-option", type=str, help="some option") -args = parser.parse_args() ----- - -- Then, you can access which command was run with `args.command` (which in this case would only have 1 possible value of `my_command`), and access any parser or subparsers options with `args`, for example, `args.my_option`. -==== - -[source,bash] ----- -$HOME/whin.py ----- ----- -usage: whin.py [-h] {stations,cc,daily} ... -positional arguments: - {stations,cc,daily} possible commands - stations list the stations - cc list the most recent data from each weather station - daily list data from a given day and station -optional arguments: - -h, --help show this help message and exit ----- - -[TIP] -==== -A good way to print the help information if no arguments are provided is: - -[source,python] ----- -if len(sys.argv) == 1: - parser.print_help() - parser.exit() ----- -==== - -[source,bash] ----- -$HOME/whin.py stations -h ----- ----- -usage: whin.py stations [-h] -optional arguments: - -h, --help show this help message and exit ----- - -[source,bash] ----- -$HOME/whin.py cc -h ----- ----- -usage: whin.py cc [-h] [-c CENTER] [-r RADIUS] -optional arguments: - -h, --help show this help message and exit - -c CENTER, --center CENTER - return results near this center coordinate, given as a - latitude,longitude pair - -r RADIUS, --radius RADIUS - search distance, in meters, from the center ----- - -[source,bash] ----- -$HOME/whin.py cc ----- ----- -[{'humidity': 90, 'latitude': 40.93894, 'longitude': -86.47418, 'name': 'WHIN001-PULA001', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.051', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 6, 'soil_moist_2': 11, 'soil_moist_3': 14, 'soil_moist_4': 9, 'soil_temp_1': 42, 'soil_temp_2': 40, 'soil_temp_3': 40, 'soil_temp_4': 41, 'solar_radiation': 203, 'solar_radiation_high': 244, 'station_id': 1, 'temperature': 40, 'temperature_high': 40, 'temperature_low': 40, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '22.5', 'wind_gust_speed_mph': 6, 'wind_speed_mph': 3}, {'humidity': 88, 'latitude': 40.73083, 'longitude': -86.98467, 'name': 'WHIN003-WHIT001', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.051', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 6, 'soil_moist_2': 5, 'soil_moist_3': 6, 'soil_moist_4': 4, 'soil_temp_1': 40, 'soil_temp_2': 39, 'soil_temp_3': 39, 'soil_temp_4': 40, 'solar_radiation': 156, 'solar_radiation_high': 171, 'station_id': 3, 'temperature': 40, 'temperature_high': 40, 'temperature_low': 39, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '337.5', 'wind_gust_speed_mph': 8, 'wind_speed_mph': 3}] ----- - -[IMPORTANT] -==== -Your values may be different because they are _current_ conditions. -==== - -[source,bash] ----- -$HOME/whin.py cc --radius=10000 ----- ----- -Need both center AND radius, or neither. ----- - -[source,bash] ----- -$HOME/whin.py cc --center=40.4258686,-86.9080654 ----- ----- -Need both center AND radius, or neither. ----- - -[source,bash] ----- -$HOME/whin.py cc --center=40.4258686,-86.9080654 --radius=10000 ----- ----- -[{'humidity': 86, 'latitude': 40.42919, 'longitude': -86.84547, 'name': 'WHIN008-TIPP005 Chatham Square', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '30.012', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 5, 'soil_moist_3': 5, 'soil_moist_4': 5, 'soil_temp_1': 42, 'soil_temp_2': 41, 'soil_temp_3': 41, 'soil_temp_4': 42, 'solar_radiation': 191, 'solar_radiation_high': 220, 'station_id': 8, 'temperature': 42, 'temperature_high': 42, 'temperature_low': 42, 'wind_direction_degrees': '0', 'wind_gust_direction_degrees': '22.5', 'wind_gust_speed_mph': 9, 'wind_speed_mph': 3}, {'humidity': 86, 'latitude': 40.38494, 'longitude': -86.84577, 'name': 'WHIN027-TIPP003 EXT', 'observation_time': '2021-03-16T18:45:00Z', 'pressure': '29.515', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 4, 'soil_moist_3': 4, 'soil_moist_4': 5, 'soil_temp_1': 43, 'soil_temp_2': 42, 'soil_temp_3': 42, 'soil_temp_4': 42, 'solar_radiation': 221, 'solar_radiation_high': 244, 'station_id': 27, 'temperature': 43, 'temperature_high': 43, 'temperature_low': 43, 'wind_direction_degrees': '337.5', 'wind_gust_direction_degrees': '337.5', 'wind_gust_speed_mph': 6, 'wind_speed_mph': 3}] ----- - -[source,bash] ----- -$HOME/whin.py daily ----- ----- -usage: whin.py daily [-h] station_id date -whin.py daily: error: too few arguments ----- - -[source,bash] ----- -$HOME/whin.py daily 150 2021-01-05 ----- ----- -[{'humidity': 96, 'latitude': 41.00467, 'longitude': -86.68428, 'name': 'WHIN058-PULA007', 'observation_time': '2021-01-05T05:00:00Z', 'pressure': '29.213', 'rain': '0', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 6, 'soil_moist_3': 7, 'soil_moist_4': 5, 'soil_temp_1': 33, 'soil_temp_2': 34, 'soil_temp_3': 35, 'soil_temp_4': 35, 'solar_radiation': 0, 'solar_radiation_high': 0, 'station_id': 150, 'temperature': 31, 'temperature_high': 31, 'temperature_low': 31, 'wind_direction_degrees': '270', 'wind_gust_direction_degrees': '292.5', 'wind_gust_speed_mph': 13, 'wind_speed_mph': 8}, {'humidity': 96, 'latitude': 41.00467, 'longitude': -86.68428, 'name': 'WHIN058-PULA007', 'observation_time': '2021-01-05T05:15:00Z', 'pressure': '29.207', 'rain': '1', 'rain_inches_last_hour': '0', 'soil_moist_1': 5, 'soil_moist_2': 6, 'soil_moist_3': 7, 'soil_moist_4': 5, 'soil_temp_1': 33, 'soil_temp_2': 34, 'soil_temp_3': 35, 'soil_temp_4': 35, 'solar_radiation': 0, 'solar_radiation_high': 0, 'station_id': 150, 'temperature': 31, 'temperature_high': 31, 'temperature_low': 31, 'wind_direction_degrees': '270', 'wind_gust_direction_degrees': '292.5', 'wind_gust_speed_mph': 14, 'wind_speed_mph': 9}] ----- - -++++ - -++++ - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== - -=== Question 5 - -There are a multitude of improvements and/or features that we could add to `whin.py`. Customize your script (as a new script called `question05.py`), to either do something new, or fix a scenario that wasn't covered in question 4. Be sure to include 1-2 sentences that explains exactly what your modification does. Demonstrate the feature by running it in a bash code chunk. - -.Items to submit -==== -- The entirety of the updated (working) script's content in a Python code chunk with chunk option "eval=F". -- Output from running your script with the given examples. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project10.adoc deleted file mode 100644 index 6c290036b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project10.adoc +++ /dev/null @@ -1,200 +0,0 @@ -= STAT 39000: Project 10 -- Spring 2021 - -**Motivation:** The use of a suite of packages referred to as the `tidyverse` is popular with many R users. It is apparent just by looking at `tidyverse` R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We've covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The `tidyverse` consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the `tidyverse` is getting the data to be https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[tidy]. You can read a lot more about this in Hadley Wickham's excellent book, https://r4ds.had.co.nz[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html#what-you-will-learn[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: - 1. Transform - 2. Visualize - 3. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/okcupid/filtered/*.csv` - -== Questions - -=== Question 1 - -Let's (more or less) follow the guidelines given above. The first step is to https://r4ds.had.co.nz/data-import.html[import] the data. There are two files: `questions.csv`, and `users.csv`. Read https://r4ds.had.co.nz/data-import.html[this section], and use what you learn to read in the two files into `questions` and `users`, respectively. Which functions from the `tidyverse` did you use and why? - -[TIP] -==== -Its easy to load up the `tidyverse` packages: - -[source,r] ----- -library(tidyverse) ----- -==== - -[TIP] -==== -Just because a file has the `.csv` extension does _not_ mean that is it comma separated. -==== - -[TIP] -==== -Make sure to print all `tibble` after reading them in to ensure that they were read in correctly. If they were not, use a different function (from `tidyverse`) to read in the data. -==== - -[TIP] -==== -`questions` should be 2281 x 10 and `users` should be 68371 x 2284 -==== - -.Items to submit -==== -- R code used to solve the problem. -- `head` of each dataset, `users` and `questions`. -- 1 sentence explaining which functions you used (from `tidyverse`) and why. -==== - -=== Question 2 - -You may recall that the function `read.csv` from base R reads data into a data.frame by default. In the `tidyverse`, `readr` functions read the data into a `tibble` instead. Read https://r4ds.had.co.nz/tibbles.html[this section]. To summarize, some important features that are true for `tibbles` but not necessarily for data.frames are: - -- Non-syntactic variable names (surrounded by backticks \\`` ` `` ) -- Never changes the type of the inputs (for example converting strings to factors) -- More informative output from printing -- No partial matching -- Simple https://r4ds.had.co.nz/tibbles.html#subsetting[subsetting] - -Great, the next step in our outline is to make the data "tidy". Read https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[this section]. Okay, let's say, for instance, that we wanted to create a `tibble` with the following columns: `user`, `question`, `question_text`, `selected_option`, `race`, `gender2`, `gender_orientation`, `n`, and `keywords`. As you can imagine, the "tidy" format, while great for analysis, would _not_ be great for storage as there would be a row for each question for each user, at least. Columns like `gender2` and `race` don't change for a user, so we end up with a lot of repeated values. - -Okay, we don't need to analyze all 68000 users at once, let's instead, take a random sample of 2200 users, and create a "tidy" `tibble` as described above. After all, we want to see why this format is useful! While trying to figure out how to do this may seem daunting at first, it is actually not _so_ bad: - -First, we convert the `users` tibble to long form, so each row represents 1 answer to 1 questions from 1 user: - -[source,r] ----- -# Add an "id" columns to the users data -users$id <- 1:nrow(users) -# To ensure we get the same random sample, run the set.seed line -# before every time you run the following line -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% # This converts all of our columns in columns_to_pivot to strings - pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") # The old qXXXX columns are now values in the "question" column. ----- - -Next, we want to merge our data from the `questions` tibble with our `users_sample_long` tibble, into a new table we will call `myDF`. How many rows and columns are in `myDF`? - -[source,r] ----- -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 3 - -Excellent! Now, we have a nice tidy dataset that we can work with. You may have noticed some odd syntax `%>%` in the code provided in the previous question. `%>%` is the piping operator in R added by the `magittr` package. It works pretty much just like `|` does in bash. It "feeds" the output from the previous bit of code to the next bit of code. It is extremely common practice to use this operator in the `tidyverse`. - -Observe the `head` of `myDF`. Notice how our `question` column has the value `d_age`, `text` has the content "Age", and `selected_option` (the column that shows the "answer" the user gave), has the actual age of the user. Wouldn't it be better if our `myDF` had a new column called `age` instead of `age` being an answer to a question? - -Modify the code provided in question 2 so `age` ends up being a column in `myDF` with the value being the actual age of the user. - -[TIP] -==== -Pay close attention to https://tidyr.tidyverse.org/reference/pivot_longer.html[`pivot_longer`]. You will need to understand what this function is doing to fix this. -==== - -[TIP] -==== -You can make a single modification to 1 line to accomplish this. Pay close attention to the `cols` option in `pivot_longer`. If you include a column in `cols` what happens? If you exclude a columns from `cols` what happens? Experiment on the following `tibble`, using different values for `cols`, as well as `names_to`, and `values_to`: - -[source,r] ----- -myDF <- tibble( - x=1:3, - y=1, - question1=c("How", "What", "Why"), - question2=c("Really", "You sure", "When"), - question3=c("Who", "Seriously", "Right now") -) ----- -==== - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 4 - -Wow! That is pretty powerful! Okay, it is clear that there are question questions, where the column starts with "q", and other questions, where the column starts with something else. Modify question (3) so all of the questions that _don't_ start with "q" have their own column in `myDF`. Like before, show the number of rows and columns for the new `myDF`, as well as print the `head`. - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 5 - -It seems like we've spent the majority of the project just wrangling our dataset -- that is normal! You'd be incredibly lucky to work in an environment where you recieve data in a nice, neat, perfect format. Let's do a couple basic operations now, to practice. - -https://dplyr.tidyverse.org/reference/mutate.html[`mutate`] is a powerful function in `dplyr`, that is not easy to mimic in Python's `pandas` package. `mutate` adds new columns to your tibble, while preserving your existing columns. It doesn't sound very powerful, but it is. - -Use mutate to create a new column called `generation`. `generation` should contain "Gen Z" for ages [0, 24], "Millenial" for ages [25-40], "Gen X" for ages [41-56], and "Boomers II" for ages [57-66], and "Older" for all other ages. - -.Items to submit -==== -- R code used to solve the problem. -- The number of rows and columns in `myDF`. -- The `head` of `myDF`. -==== - -=== Question 6 - -Use `ggplot` to create a scatterplot showing `d_age` on the x-axis, and `lf_min_age` on the y-axis. `lf_min_age` is the minimum age a user is okay dating. Color the points based on `gender2`. Add a proper title, and labels for the X and Y axes. Use `alpha=.6`. - -[NOTE] -==== -This may take quite a few minutes to create. Before creating a plot with the entire `myDF`, use `myDF[1:10,]`. If you are in a time crunch, the minimum number of points to plot to get full credit is 100, but if you wait, the plot is a bit more telling. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project11.adoc deleted file mode 100644 index df95a9f56..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project11.adoc +++ /dev/null @@ -1,206 +0,0 @@ -= STAT 39000: Project 11 -- Spring 2021 - -**Motivation:** Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. `tidyverse` is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We have covered a few topics on the `tidyverse` packages, but there is a lot more to learn! We will continue our strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The `tidyverse` consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the `tidyverse` is getting the data to be https://r4ds.had.co.nz/tidy-data.html#tidy-data-1[tidy]. You can read a lot more about this in Hadley Wickham's excellent book, https://r4ds.had.co.nz[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html#what-you-will-learn[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: - 1. Transform - 2. Visualize - 3. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/okcupid/filtered/*.csv` - -== Questions - -[source,r] ----- -datamine_py() -library(tidyverse) -questions <- read_csv2("/class/datamine/data/okcupid/filtered/questions.csv") -users <- read_csv("/class/datamine/data/okcupid/filtered/users.csv") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot[-1242], names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -users$id <- 1:nrow(users) -set.seed(12345) -columns_to_pivot <- 1:2278 -users_sample_long <- users[sample(nrow(users), 2200),] %>% - mutate_at(columns_to_pivot, as.character) %>% - pivot_longer(cols = columns_to_pivot[-(which(substr(names(users), 1, 1) != "q"))], names_to="question", values_to = "selected_option") -myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X") ----- - -[source,r] ----- -myDF <- myDF %>% mutate(generation=case_when(d_age<=24 ~ "Gen Z", - between(d_age, 25, 40) ~ "Millenial", - between(d_age, 41, 56) ~ "Gen X", - between(d_age, 57, 66) ~ "Boomers II", - TRUE ~ "Other")) ----- - -[source,r] ----- -ggplot(myDF[1:100,]) + - geom_point(aes(x=d_age, y = lf_min_age, col=gender2), alpha=.6) + - labs(title="Minimum dating age by gender", x="User age", y="Minimum date age") ----- - -=== Question 1 - -Let's pick up where we left in project 10. For those who struggled with project 10, I will post the solutions above either on Saturday morning, or at the latest Monday. Re-run your code from project 10 so we, once again, have our `tibble`, `myDF`. - -At the end of project 10 we created a scatterplot showing `d_age` on the x-axis, and `lf_min_age` on the y-axis. In addition, we colored the points by `gender2`. In many cases, instead of just coloring the different dots, we may want to do the exact _same_ plot for _different_ groups. This can easily be accomplished using `ggplot`. - -*Without* splitting or filtering your data prior to creating the plots, create a graphic with plots for each `generation` where we show `d_age` on the x-axis and `lf_min_age` on the y-axis, colored by `gender2`. - -[IMPORTANT] -==== -You do not need to modify `myDF` at all. -==== - -[IMPORTANT] -==== -This may take quite a few minutes to create. Before creating a plot with the entire myDF, use myDF[1:50,]. If you are in a time crunch, the minimum number of points to plot to get full credit is 500, but if you wait, the plot is a bit more telling. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== - -=== Question 2 - -By default, `facet_wrap` and `facet_grid` maintain the same scale for the x and y axes across the various plots. This makes it easier to compare visually. In this case, it may make it harder to see the patterns that emerge. Modify your code from question (1) to allow each facet to have its own x and y axis limits. - -[TIP] -==== -Look at the argument `scales` in the `facet_wrap`/`facet_grid` functions. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -==== - -=== Question 3 - -Let's say we have a theory that the older generations tend to smoke more. You decided you want to create a plot that compares the percentage of smokers per `generation`. Before we do this, we need to wrangle the data a bit. - -What are the possible values of `d_smokes`? Create a new column in `myDF` called `is_smoker` that has values `TRUE`, `FALSE`, or `NA` when applicable. You will need to determine how you will assign a user as a smoker or not -- this is up to you! Explain your cutoffs. Make sure you stay in the `tidyverse` to solve this problem. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining your logic and cutoffs for the new `is_smoker` column. -- The `table` of the `is_smoker` column. -==== - -=== Question 4 - -Great! Now that we have our new `is_smoker` column, create a new `tibble` called `smokers_per_gen`. `smokers_per_gen` should be a summary of `myDF` containing the percentage of smokers per `generation`. - -[TIP] -==== -The result, `smokers_per_gen` should have 2 columns: `generation` and `percentage_of_smokers`. It should have the same number of rows as the number of `generations`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create a Cleveland dot plot using `ggplot` to show the percentage of smokers for each different `generation`. Use `ggthemr` to give your plot a new look! You can choose any theme you'd like! - -Is our theory from question (3) correct? Explain why you think so, or not. - -(OPTIONAL I, 0 points) To make the plot have a more aesthetic look, consider reordering the data by percentage of smokers, or even by the age of `generation`. You can do that before passing the data using the `arrange` function, or inside the `geom_point` function, using the `reorder` function. To re-order by `generation`, you can either use brute force, or you can create a new column called `avg_age` while using `summarize`. `avg_age` should be the average age for each group (using the variable `d_age`). You can use this new column, `avg_age` to re-order the data. - -(OPTIONAL II, 0 points) Improve our plot, change the x-axis to be displayed as a percentage. You can use the `scales` package and the function `scale_x_continuous` to accomplish this. - -[TIP] -==== -Use `geom_point` **not** `geom_dotplot` to solve this problem. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -- 1-2 sentences commenting on the theory, and what are your conclusions based on your plot (if any). -==== - -=== Question 6 - -Create an interesting visualization to answer a question you have regarding this dataset. Have fun playing with the different aesthetics. Make sure to modify your x-axis title, y-axis title, and title of your plot. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -- The question you are interested in answering. -- 1-2 sentences describing your plot, and the answer to your question. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project12.adoc deleted file mode 100644 index 32908bcd5..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project12.adoc +++ /dev/null @@ -1,119 +0,0 @@ -= STAT 39000: Project 12 -- Spring 2021 - -**Motivation:** As we mentioned before, data wrangling is a big part in any data driven project. https://www.amazon.com/Exploratory-Data-Mining-Cleaning/dp/0471268518["Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis."] Therefore, it is worth to spend some time mastering how to best tidy up our data. - -**Context:** We are continuing to practice using various `tidyverse` packages, in order to wrangle data. - -**Scope:** python - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The first step in any data science project is to define our problem statement. In this project, our goal is to gain insights into customers' behaviours with regards to online orders and restaurant ratings. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/class/datamine/data/restaurant_recommendation/*.csv` - -== Questions - -=== Question 1 - -Load the `tidyverse` suite a packages, and read the data from files `orders.csv`, `train_customers.csv`, and `vendors.csv` into `tibble`s named `orders`, `customers`, and `vendors` respectively. - -Take a look the `tibbles` and describe in a few sentences the type of information contained in each dataset. Although the name can be self-explanatory, it is important to get an idea of what exactly we are looking at. For each combination of 2 datasets, which column would you use to join them? - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences explaining each dataset (`orders`, `customers`, and `vendors`). -- 1-2 sentences for each combination of 2 datasets describing if we could combine the datasets or not, and which column you would you use to join them. -==== - -=== Question 2 - -Let's tidy up our datasets a bit prior to joining them. For each dataset, complete the tasks below. - -- `orders`: remove columns from and between `preparationtime` to `delivered_time` (inclusive). -- `customers`: take a look at the column `dob`. Based on its values, what do you believe it was supposed to contain? Can we rely on the numbers selected? Why or why not? Based on your answer, keep the columns `akeed_customer_id`, `gender`, and `dob`, OR just `akeed_customer_id` and `gender`. -- `vendors`: take a look at columns `country_id` and `city_id`. Would they be useful to compare the vendors in our dataset? Why or why not? If not, remove the columns from the dataset. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences describing what columns you kept for `vendors` and `customers` and why. -==== - -=== Question 3 - -Use your solutions from questions (1) and (2), and the join functions from tidyverse (`inner_join`, `left_join`, `right_join`, and `full_join`) to create a single `tibble` called `myDF` containing information only where all 3 `tibbles` intersect. - -For example, we do not want `myDF` to contain orders from customers that are not in `customers` tibble. Which function(s) from the tidyverse did you use to merge the datasets and why? - -[TIP] -==== -`myDF` should have 132,226 rows. -==== - -[TIP] -==== -When combining two datasets, you may want to change the argument `suffix` in the join function to specify from which dataset it came from. For example, when joining `customers` and `orders`: `*_join(customers, orders, suffix = c('_customers', '_orders'))`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 1-2 sentences describing which function you used, and why. -==== - -=== Question 4 - -Great, now we have a single, tidy dataset to work with. There are 2 vendor categories in myDF, `Restaurants` and `Sweets & Bakes`. We would expect there to be some differences. Let's compare them using the following variables: `deliverydistance`, `item_count`, `grand_total`, and `vendor_discount_amount`. Our end goal (by the end of question 5) is to create a histogram colored by the vendor's category (`vendor_category_en`), for each variable. - -To accomplish this easily using `ggplot`, we will take advantage of `pivot_longer`. Pivot columns `deliverydistance`, `item_count`, `grand_total`, and `vendor_discount_amount` in `myDF`. The end result should be a `tibble` with columns `variable` and `values`, which contain the name of the pivoted column (`variable`), and values of those columns (`values`) Call this modified dataset `myDF_long`. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Now that we have the data in the ideal format for our plot, create a histogram for each variable. Make sure to color them by vendor category (`vendor_category_en`). How do the two types of vendors compare in these 4 variables? - -[TIP] -==== -Use the argument `fill` instead of `color` in `geom_histogram`. -==== - -[TIP] -==== -You may want to add some transparency to your plot. Add it using `alpha` argument in `geom_histogram`. -==== - -[TIP] -==== -You may want to change the argument `scales` in `facet_*`. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- 2-3 sentences comparing `Restaurants` and `Sweets & Bakes` for `deliverydistance`, `item_count`, `grand_total` and `vendor_discount_amount`. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project13.adoc deleted file mode 100644 index d61e2061a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project13.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= STAT 39000: Project 13 -- Spring 2021 - -**Motivation:** Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we've practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data. - -**Context:** We will continue to gain familiarity with the `tidyverse` suite of packages (including `ggplot`), and data wrangling tasks. - -**Scope:** r, tidyverse - -.Learning objectives -**** -- Explain the differences between regular data frames and tibbles. -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, and transmute functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -`/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -Read the dataset into a `tibble` named `complaintsDF`. This dataset contains consumer complaints for over 5,000 companies. Our goal is to create a `tibble` called `companyDF` containing the following summary information for each company: - -- `Company`: The company name (`Company`) -- `State`: The state (`State`) -- `percent_timely_response`: Percentage of timely complaints (`Timely response?`) -- `percent_consumer_disputed`: Percentage of complaints that were disputed by the consumer (`Consumer disputed?`) -- `percent_submitted_online`: Percentage of complaints that were submitted online (use column `Submitted via`, and consider a submission to be an online submission if it was submitted via `Web` or `Email`) -- `total_n_complaints`: Total number of complaints - -There are various ways to create `companyDF`. Let's practice using the pipes (`%>%`) to get `companyDF`. The idea is that our code at the end of question 2 will look something like this: - -[source,r] ----- -companyDF <- complaintsDF %>% - insert_here_code_to_change_variables %>% # (question 1) - insert_here_code_to_group_and_get_summaries_per_group # (question 2) ----- - -First, create logical columns (columns containing `TRUE` or `FALSE`) for `Timely response?`, `Consumer disputed?` and `Submitted via` named `timely_response_log`, `consumer_disputed_log` and `submitted_online`, respectively. - -`timely_response_log` and `consumer_disputed_log` will have value `TRUE` if `Timely response?` and `Consumer disputed?` have values `Yes` respectively, and `FALSE` if the value for the original column is `No`. `submitted_online` will have value `TRUE` if the the complaint was submitted via `Web` or `Email`. - -You can double check your results for each column by getting a table with the original and modified column, as shown below. In this case, we would want all `TRUE` values to be in row `Yes`, and all `FALSE` to be in row `No`. - -[source,r] ----- -table(companyDF$`Timely response?`, companyDF$timely_response_log) ----- - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 2 - -Continue the pipeline we started in question (1). Get the summary information for each company. Note that you will need to include more pipes in the pseudo-code from question (1) as we want the summary for _each_ company in _each_ state. If a company is present in 4 states, `companyDF` should have 4 rows for that company -- one for each state. For the rest of the project, we will refer to a company as its unique combination of `Company` and `State`. - -[TIP] -==== -The function `n()` from `dplyr` counts the number of observations in the current group. It can only by used within `mutate`/`transmute`, `filter`, and the `summarize` functions. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 3 - -Using `ggplot2`, create a scatterplot showing the relationship between `percent_timely_response` and `percent_consumer_disputed` for companies with at least 500 complaints. Based on your results, do you believe there is an association between how timely the company's response is, and whether the consumer disputes? Why or why not? - -[TIP] -==== -Remember, here we consider each row of `companyDF` a unique company. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 4 - -Which company, with at least 250 complaints, has the highest percent of consumer dispute? - -[IMPORTANT] -==== -We are learning `tidyverse`, so use `tidyverse` functions to solve this problem. -==== - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -==== - -=== Question 5 - -Create a graph using `ggplot2` that compares `States` based on any columns from `companyDF` or `complaintsDF`. You may need to summarize the data, filter, or even create new variables depending on what your metric of comparison is. Below are some examples of graphs that can be created. Do not feel limited by them. Make sure to change the labels for each axis, add a title, and change the theme. - -- Cleveland's dotplot for the top 10 states with the highest ratio between percent of disputed complaints and timely response. -- Bar graph showing the total number of complaints in each state. -- Scatterplot comparing the percentage of timely responses in the state and average number of complaints per state. -- Line plot, where each line is a state, showing the total number of complaints per year. - -.Items to submit -==== -- R code used to solve the problem. -- Output from running your code. -- The plot produced. -- 1-2 sentences commenting on your plot. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project14.adoc deleted file mode 100644 index 11a09e906..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2021/39000/39000-s2021-project14.adoc +++ /dev/null @@ -1,131 +0,0 @@ -= STAT 39000: Project 14 -- Spring 2021 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you've learned, and review topics and languages in a generic way. - -**Context:** We are on the last project where we will leave it up to you on how to solve the problems presented. - -**Scope:** python, r, bash, unix, computers - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset - -The following questions will use the dataset found in Scholar: - -- `/class/datamine/data/disney` -- `/class/datamine/data/movies_and_tv/imdb.db` -- `/class/datamine/data/amazon/music.txt` -- `/class/datamine/data/craigslist/vehicles.csv` -- `/class/datamine/data/flights/2008.csv` - -== Questions - -[IMPORTANT] -==== -Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don't feel limited by one language, you can use different languages to answer different questions. If you are feeling bold, you can also try answering the questions using all languages! -==== - -=== Question 1 - -What percentage of flights in 2008 had a delay due to the weather? Use the `/class/datamine/data/flights/2008.csv` dataset to answer this question. - -[TIP] -==== -Consider a flight to have a weather delay if `WEATHER_DELAY` is greater than 0. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - - -=== Question 2 - -Which listed manufacturer has the most expensive previously owned car listed in Craiglist? Use the `/class/datamine/data/craigslist/vehicles.csv` dataset to answer this question. Only consider listings that have listed price less than $500,000 _and_ where manufacturer information is available. - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 3 - -What is the most common and least common `type` of title in imdb ratings? Use the `/class/datamine/data/movies_and_tv/imdb.db` dataset to answer this question. - -[TIP] -==== -Use the `titles` table. -==== - -[TIP] -==== -Don't know how to use SQL yet? To get this data into an R data.frame , for example: - -[source,r] ----- -library(tidyverse) -con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") -myDF <- tbl(con, "titles") ----- -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 4 - -What percentage of music reviews contain the words `hate` or `hated`, and what percentage contain the words `love` or `loved`? Use the `/class/datamine/data/amazon/music.txt` dataset to answer this question. - -[TIP] -==== -It _may_ take a minute to run, depending on the tool you use. -==== - -.Items to submit -==== -- The code used to solve the question. -- The answer to the question. -==== - -=== Question 5 - -What is the best time to visit Disney? Use the data provided in `/class/datamine/data/disney` to answer the question. - -First, you will need determine what you will consider "time", and the criteria you will use. See below some examples. Don't feel limited by them! Be sure to explain your criteria, use the data to investigate, and determine the best time to visit! Write 1-2 sentences commenting on your findings. - -- As Splash Mountain is my favorite ride, my criteria is the smallest monthly average wait times for Splash Mountain between the years 2017 and 2019. I'm only considering these years as I expect them to be more representative. My definition of "best time" will be the "best months". -- Consider "best times" the days of the week that have the smallest wait time on average for all rides, or for certain favorite rides. -- Consider "best times" the season of the year where the park is open for longer hours. -- Consider "best times" the weeks of the year with smallest average high temperature in the day. - -.Items to submit -==== -- The code used to solve the question. -- 1-2 sentences detailing the criteria you are going to use, its logic, and your defition for "best time". -- The answer to the question. -- 1-2 sentences commenting on your answer. -==== - -=== Question 6 - -Finally, use RMarkdown (and its formatting) to outline 3 things you learned this semester from The Data Mine. For each thing you learned, give a mini demonstration where you highlight with text and code the thing you learned, and why you think it is useful. If you did not learn anything this semester from The Data Mine, write about 3 things you _want_ to learn. Provide examples that demonstrate _what_ you want to learn and write about _why_ it would be useful. - -[IMPORTANT] -==== -Make sure your answer to this question is formatted well and makes use of RMarkdown. -==== - -.Items to submit -==== -- 3 clearly labeled things you learned. -- 3 mini-demonstrations where you highlight with text and code the thin you learned, and why you think it is useful. -OR -- 3 clearly labeled things you _want_ to learn. -- 3 examples demonstrating _what_ you want to learn, with accompanying text explaining _why_ you think it would be useful. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project01.adoc deleted file mode 100644 index cd87b185a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project01.adoc +++ /dev/null @@ -1,431 +0,0 @@ -= STAT 19000: Project 1 -- Spring 2022 - -**Motivation:** Last semester each project was completed in Jupyter Lab from https://ondemand.brown.rcac.purdue.edu. Although our focus this semester is on the use of Python to solve data-driven problems, we still get to stay in the same environment. In fact, Jupyter Lab is Python-first. Now instead of using the `f2021-s2022-r` kernel, instead select the `f2021-s2022` kernel. - -**Context:** In this project we will re-familiarize ourselves with Jupyter Lab and its capabilities. We will also introduce Python and begin learning some of the syntax. - -**Scope:** Python, Jupyter Lab - -.Learning Objectives -**** -- Use Jupyter Lab to run Python code and create Markdown text. -- Gain exposure to some Python syntax. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/noaa/*.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -[WARNING] -==== -Please review our updated xref:book:projects:submissions.adoc[submission guidelines] before submitting your project. -==== - -Let's start the semester by doing some basic Python work, and compare and contrast with R. - -First, let's learn how to run R code in our regular, non-R kernel, `f2021-s2022`. In a cell, run the following code in order to load the extension that allows us to run R code. - -[source,ipython] ----- -%load_ext rpy2.ipython ----- - -[NOTE] -==== -You can see these intructions xref:templates.adoc[here]. -==== - -Next, in order to actually _run_ R code, we need to place the following in the first line of _every_ cell where we want to run R code: `%%R`. You can think of this as declaring the code cell as an _R_ code cell. For example, the following will successfully run in our `f2021-s2022` kernel. - -[source,ipython] ----- -%%R - -my_vector <- c(1,2,3,4,5) -my_vector ----- - -Great! Run the cell in your notebook to see the output. - -Now, let's perform the equivalent operations in Python! - -[source,python] ----- -my_vector = (1,2,3,4,5) -my_vector ----- - -[IMPORTANT] -==== -The `f2021-s2022` kernel is a _normal_ kernel. Essentially, it will assume that code in a cell is Python code, unless "told" otherwise. -==== - -As you can see -- the output is essentially the same. However, in Python, there are actually a few "primary" ways you could do this. - -[source,python] ----- -my_tuple = (1,2,3,4,5,) -my_tuple ----- - -[source,python] ----- -my_list = [1,2,3,4,5,] -my_list ----- - -[source,python] ----- -import numpy as np - -my_array = np.array([1,2,3,4,5]) -my_array ----- - -The first two options are part of the Python standard library. The first option is a tuple, which is a list of values. A tuple is _immutable_, meaning that once you create it, you cannot change it. - -[source,python] ----- -my_tuple = (1,2,3,4,5) -my_tuple[0] = 10 # error! ----- - -The second option is a list, which is a list of values. A list is _mutable_, meaning that once you create it, you can change it. - -[source,python] ----- -my_list = [1,2,3,4,5] -my_list[0] = 10 -my_list # [10,2,3,4,5] ----- - -The third option is a numpy array, which is a list of values. A numpy array is _mutable_, meaning that once you create it, you can change it. In order to use numpy arrays, you must _import_ the numpy package first. `numpy` is a numerical computation library that is optimized for a lot of the work we will be doing this semester. With that being said, its best to get to learn about the basics in Python first, as a _lot_ can be accomplished without using `numpy`. - -For this question, read as much as you can about tuples and lists, and run the examples we provided above. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -In general, tuples are used when you have a set of known values that you want to store and access efficiently. Lists are used when you want to do the same, but you have the need to manipulate the data within. Most often, lists will be your go-to. - -In Python, lists are an _object_. Objects have _methods_. Methods are most simply defined as functions that are associated with and operate on the data (usually) within the object itself. - -https://docs.python.org/3/tutorial/datastructures.html#more-on-lists[Here] you can find a list of the list methods. For example, the _append_ method adds an item to the end of a list. - -Methods are _called_ using dot notation. The following is an example of using the _append_ method and dot notation to add the number 99 to the end of our list, `my_list`. - -[source,python] ----- -my_list = [1,2,3,4,5] -my_list.append(99) -my_list # [1,2,3,4,5,99] ----- - -Create a list called `my_list` with the values 1,2,3,4,5. Then, use the list methods to change `my_list` to contain the following values, in order: 7,5,4,3,2,1,6. Do _not_ manually set values using indexing -- _just_ use the list methods. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.Solution -==== -[source, python] ----- -my_list = [1,2,3,4,5] -my_list.append(7) -my_list.reverse() -my_list.append(6) -my_list ----- - ----- -[7, 5, 4, 3, 2, 1, 6] ----- -==== - -=== Question 3 - -++++ - -++++ - -Great! You may have noticed (or already know) that to get the first value in a list (or tuple) we would do `my_list[0]`. Recall that in R, we would do `my_list[1]`. This is because Python has 0-based indexing instead of 1-based indexing. While at first this may be confusing, many people find it much easier to use 0-based indexing than 1 based indexing. - -Use indexing to print the values 7,4,2,6 from the modified `my_list` in the previous question. - -Use indexing to print the values in reverse order _without_ using the `reverse` method. - -Use indexing to print the second through 4th values in `my_list` (5,4,3). - -[TIP] -==== -The "jump" feature of Python indexing will be useful here! -==== - -**Relevant topics:** xref:book:python:lists.adoc#indexing[indexing] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.Solution -==== -[source, python] ----- -my_list[::2] -my_list[::-1] -my_list[1:4] ----- - ----- -[7, 4, 2, 6] -[6, 1, 2, 3, 4, 5, 7] -[5, 4, 3] ----- -==== - -=== Question 4 - -++++ - -++++ - -Great! If you have 1 takeaway from the previous 3 questions it should be that when you see `[]` think _lists_. When you see `()` think _tuples_ (or generators, but ignore this for now). - -Its not a Data Mine project without _data_. After we get through some basics of Python, we will be primarily working with data using the `pandas` and `numpy` libraries.With that being said, there is no reason not to do some work manually in the meantime! - -[NOTE] -==== -Python does not have the data frame concept in its standard library like R does. This will most likely make things that would be simple to do in R much more complicated in Python. The `pandas` library introduces the data frame, so be patient and don't be too frustrated when we (at first) forgo the `pandas` library -==== - -Okay! Let's get started with our noaa weather data. The following is a very small sample of the `/depot/datamine/data/noaa/2020.csv` dataset. - -.sample ----- -AE000041196,20200101,TMIN,168,,,S, -AE000041196,20200101,PRCP,0,D,,S, -AE000041196,20200101,TAVG,211,H,,S, -AEM00041194,20200101,PRCP,0,,,S, -AEM00041194,20200101,TAVG,217,H,,S, -AEM00041217,20200101,TAVG,205,H,,S, -AEM00041218,20200101,TMIN,148,,,S, -AEM00041218,20200101,TAVG,199,H,,S, -AFM00040938,20200101,PRCP,23,,,S, -AFM00040938,20200101,TAVG,54,H,,S, ----- - -You can read https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] about what the data means. - -. 11 character station ID -. 8 character date in YYYYMMDD format -. 4 character element code (you can see the element codes https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] in section III) -. value of the data (varies based on the element code) -. 1 character M-flag (10 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character Q-flag (14 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character S-flag (30 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 4 character observation time (HHMM) (0700 = 7:00 AM) -- may be blank - -Since we aren't using the `pandas` library, we need to use _something_ in order to bring the data into Python. In this case, we will use the `csv` library -- a library used for reading and writing dsv (data separated value) data. - -[NOTE] -==== -The official documentation for this library is https://docs.python.org/3/library/csv.html[here]. -==== - -If you read the first example in the `csv.reader` section https://docs.python.org/3/library/csv.html#csv.reader[here], you will find the following quick and succinct example. - -[source,python] ----- -import csv <1> - -with open('eggs.csv', newline='') as csvfile: <2> - spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') <3> - for row in spamreader: <4> - print(', '.join(row)) <5> ----- - -.Output ----- -Spam, Spam, Spam, Spam, Spam, Baked Beans -Spam, Lovely Spam, Wonderful Spam ----- - -You do _not_ need to understand everything that is happening in this example (yet). With that being said, the following is an explanation for each part. - -<1> We are importing the `csv` library. If we didn't have this line, the program would crash when we try and call `csv.reader(...)` in the fourth line. -<2> We are opening the `eggs.csv` file. This is the file we will be reading. Here, `eggs.csv` is assumed to be in the same directory where we are running the code. It could just as easily be in a folder called "my_data" in the data depot, in which case we would replace `eggs.csv` with the absolute path to our file of interest: `/depot/datamine/data/my_data/eggs.csv`. In addition, we call our opened file `csvfile`. -<3> Here, we create a `csv.reader` object called `spamreader`. This object is a generator that will yield one row at a time. We can loop through this "generator" to get a single row of data at a time. -<4> Here, we are looping through each row of data from the `spamreader` object. For each loop, we save the data into a variable called `row`. Specifically, `row` is a list, where the first value is the first space-separated value in the row, the second is the second space separated value in the row, etc. We then use a _string_ method called join on the ", " string, which takes each value in the row and puts a ", " between them. This results in "Spam, Spam, Spam, ..., Baked Beans" that we see in the output. - -[NOTE] -==== -This code could have been written like this: - -[source,python] ----- -import csv - -csvfile = open('eggs.csv', newline='') -spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') -for row in spamreader: - print(', '.join(row)) - -csvfile.close() ----- - -But we have to _close_ the file -- otherwise, it could cause issues down the road. The _with_ statement, among other things, handles this automatically for you. -==== - -One important part of learning a new language is jumping right in and trying things out! Modify the provided code to read in the `2020.csv` file and print the 4th column only. - -[CAUTION] -==== -We do not want you to print out _every_ row of data -- that would be a lot and cause your notebook to crash! Instead, in the line following the `print` statement write `break`. We will learn about this later, but the `break` statement will stop the loop as soon as it is run. This will cause the program to just print the first line of data. - -In general, we _never_ want more than 10 or so lines -- maybe 100 at the maximum. When in doubt, just print 10 lines. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.Solution -==== -[source, python] ----- -import csv - -with open('/depot/datamine/data/noaa/2020.csv') as csvfile: - reader = csv.reader(csvfile, delimiter=',') - for row in reader: - print(row[3]) - break ----- - ----- -168 ----- -==== - -=== Question 5 - -++++ - -++++ - -Below we've provided you with code that we would like you to fill in. Print the first 10 rows of the data. - -[source,python] ----- -import csv - -with open('/depot/datamine/data/noaa/2020.csv') as my_file: - reader = csv.reader(my_file) - - # TODO: create variable to store how many rows we've printed so far - - for row in reader: - print(row) - - # TODO: increment the variable storing our count, since we've printed a row - - # TODO: if we've printed 10 rows, run the break statement - break ----- - -[TIP] -==== -You will need to indent the `break` statement to run it "within" the if statement you will create. Yes, we haven't taught if statements yet, but you can do this! -==== - -[TIP] -==== -If you want to try and solve this another way, Google "enumerate Python" and see if you can figure out how to do this _without_ using the counting variable you create. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.Solution -==== -[source, python] ----- -import csv - -with open('/depot/datamine/data/noaa/2020.csv') as my_file: - reader = csv.reader(my_file) - - for ct, row in enumerate(reader): - print(row) - - if ct == 9: - break ----- - ----- -['AE000041196', '20200101', 'TMIN', '168', '', '', 'S', ''] -['AE000041196', '20200101', 'PRCP', '0', 'D', '', 'S', ''] -['AE000041196', '20200101', 'TAVG', '211', 'H', '', 'S', ''] -['AEM00041194', '20200101', 'PRCP', '0', '', '', 'S', ''] -['AEM00041194', '20200101', 'TAVG', '217', 'H', '', 'S', ''] -['AEM00041217', '20200101', 'TAVG', '205', 'H', '', 'S', ''] -['AEM00041218', '20200101', 'TMIN', '148', '', '', 'S', ''] -['AEM00041218', '20200101', 'TAVG', '199', 'H', '', 'S', ''] -['AFM00040938', '20200101', 'PRCP', '23', '', '', 'S', ''] -['AFM00040938', '20200101', 'TAVG', '54', 'H', '', 'S', ''] ----- -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02-q02-sol.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02-q02-sol.adoc deleted file mode 100644 index a233a1259..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02-q02-sol.adoc +++ /dev/null @@ -1,10 +0,0 @@ -[source, python] ----- -print(f"There are {df.shape[1]} columns in the DataFrame!") -print(f"There are {df.shape[0]} rows in the DataFrame!") ----- - ----- -There are 8 columns in the DataFrame! -There are 15000000 rows in the DataFrame! ----- diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02.adoc deleted file mode 100644 index 3b3c1bd91..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project02.adoc +++ /dev/null @@ -1,291 +0,0 @@ -= STAT 19000: Project 2 -- Spring 2022 - -**Motivation:** In Python it is very important to understand some of the data types in a little bit more depth than you would in R. Many of the data types in Python will seem very familiar. A `character` in R is similar to a `str` in Python. An `integer` in R is an `int` in Python. A `float` in R is similar to a `float` in Python. A `logical` in R is similar to a `bool` in Python. In addition to all of that, there are some very popular classes that are introduced in packages like `numpy` and `pandas`. On the other hand, there are some data types in Python like `tuples`, `lists`, `sets`, and `dicts` that diverge from R a little bit more. It is integral to understand some of these before jumping too far into everything. - -**Context:** This is the second project introducing some basic data types, and demonstrating some familiar control flow concepts, all while digging right into a dataset. - -**Scope:** dicts, sets, pandas, matplotlib - -.Learning Objectives -**** -- Demonstrate the ability to read and write data of various formats using various packages. -- Explain what is a dict is and why it is useful. -- Understand how a set works and when it could be useful. -- List the differences between lists & tuples and when to use each. -- Gain familiarity with dict methods. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/noaa/2020_sample.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -In the previous project, we started to get a feel for how lists and tuples work. As a part of this, we had you use the `csv` package to read in and process data. While this can certainly be useful, and is an efficient way to handle large amounts of data, it takes a _lot_ of work to get the data in a format where you can _use_ it. - -As teased in the previous project, Python has a very popular package called `pandas` that is popular to use for many data-related tasks. If you need to understand 1 thing about `pandas` it would be that it provides 2 key data types that you can take advantage of: the `Series` and the `DataFrame` types. Each of those objects have a _ton_ of built in _attributes_ and _methods_. We will talk about this more in the future, but you can think of an _attribute_ as a piece of data within the object. You can think of a _method_ as a function closely associated with the object or class. Just know that the attributes and methods provide lots of powerful features! - -Please read the fantastic and quick 10 minute introduction to `pandas` https://pandas.pydata.org/docs/user_guide/10min.html[here]. We will be slowly introducing bits and pieces of this package throughout the semester. In addition, we will also start incorporating some plotting questions throughout the semester. - -Read in the dataset: `/depot/datamine/data/noaa/2020_sample.csv` using the `pandas` package, and store it in a variable called `df`. - -[TIP] -==== -Our dataset doesn't have column headers, but headers are useful. Use the `names` argument to the `read_csv` method to give the dataframe a column header. - -[source,python] ----- -import pandas as pd - -df = pd.read_csv('/depot/datamine/data/noaa/2020_sample.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"]) ----- -==== - -Remember in the previous project how we had you print the first 10 values of a certain column? This time, use the `head` method to print the first 10 rows of data from our dataset. Do you think it was easier or harder than doing something similar using the `csv` package? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.View solution -[%collapsible.result] -==== -include::book:projects:example$19000-s2022-project02-q02-sol.adoc[] -==== - -=== Question 2 - -++++ - -++++ - -Imagine going back and using the `csv` package to first count the number of rows of data, and then count the number of columns. Seems like a lot of work for just getting a little bit of information about your data, right? Using `pandas` this is much easier. - -Use one of the https://pandas.pydata.org/docs/reference/frame.html#attributes-and-underlying-data[attributes] from your DataFrame in combination with xref:book:python:printing-and-f-strings.adoc[f-strings] to print the following: - ----- -There are 123 columns in the DataFrame! -There are 321 rows in the DataFrame! ----- - -[NOTE] -==== -I'm _not_ asking you to literally print the numbers 123 and 321 -- replace those numbers with the actual values. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -++++ - -++++ - -Dictionaries, often referred to as dicts, are really powerful. There are two primary ways to "get" information from a dict. One is to use the get method, the other is to use square brackets and strings. Test out the following to understand the differences between the two. - -[source,python] ----- -my_dict = {"fruits": ["apple", "orange", "pear"], "person": "John", "vegetables": ["carrots", "peas"]} - -# If "person" is indeed a key, they will function the same way -my_dict["person"] -my_dict.get("person") - -# If the key does not exist, like below, they will not -# function the same way. -my_dict.get("height") # Returns None when key doesn't exist -print(my_dict.get("height")) # By printing, we can see None in this case -my_dict["height"] # Throws a KeyError exception because the key, "height" doesn't exist ----- - -Under the hood, a dict is essentially a data structure called a hash table. https://en.wikipedia.org/wiki/Hash_table[Hash tables] are a data structure with a useful set of properties. The time needed for searching, inserting, or removing a piece of data has a constant average lookup time. This means that no matter how big your hash table grows to be, inserting, searching, or deleting a piece of data will usually take about the same amount of time. (The worst case time increases linearly.) Dictionaries (dict) are used a lot, so it is worthwhile to understand them. - -Dicts can also be useful to solve small tasks here and there. For example, what if we wanted to figure out how many times each of the unique `station_id` value appears? Dicts are a great way to solve this! Use the provided code to extract a list of `station_id` values from our DataFrame. Use the resulting list, a dict, and a loop to figure this out. - -[source,python] ----- -import pandas as pd - -station_ids = df["station_id"].dropna().tolist() ----- - -[TIP] -==== -You should get the following results. - -.Results ----- -print(my_dict['US1MANF0058']) # 378 -print(my_dict['USW00023081']) # 1290 -print(my_dict['US10sali004']) # 13 ----- -==== - -[TIP] -==== -If you get a `KeyError` -- don't forget -- you need to initialize the values for each key in the dict to 0 first. To get a unique list of station_id values, we can use the following code. - -[source,python] ----- -unique_ids = list(set(station_ids)) ----- - -`set` is another built-in type in Python. A useful use of `set` is that it can reduce a list to unique values _very_ efficiently. Here, we get the unique values and then convert the `set` back to a `list`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Sets are very useful! I've created a nearly identical copy of our dataset here: `/depot/datamine/data/noaa/2020_sampleB.csv`. The "sampleB" dataset has one key difference -- I've snuck in a fake row of data! There is 1 row in the new dataset that is not in the old -- it can be identified by having a `station_id` that doesn't exist in the original dataset. Print the "intruder" row of data. - -[WARNING] -==== -There are 15000000 rows in the data frame. So this method will take too long, because it requires 15000000 times 15000001 comparisons to find the intruder: - -[source,python] ----- -import pandas as pd - -df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"]) -intruder_ids = df_intruder["station_id"].dropna().tolist() - -for i in intruder_ids: - if i not in station_ids: - print(i) ----- - -It would eventually work, but it will take way too long to finish. Same problem will occur here: -The `in` operator is useful for checking if a value is in a list. It is, however, essentially the same as what we tried above; it will be way too slow. - -[source,python] ----- -for ii in intruder_ids: - found = False - for i in station_ids: - if ii == i: - found = True - if not found: - print(ii) ----- -==== - -[TIP] -==== -We need to use our `set` trick from the question 3, so that we can (instead) make 39962 times 39963 comparisons to find the intruder. For example: - -[source,python] ----- -import pandas as pd - -df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"]) -intruder_ids = df_intruder["station_id"].dropna().tolist() - -unique_intruder_ids = list(set(intruder_ids)) - -for i in unique_intruder_ids: - if i not in unique_ids: - print(i) ----- - -We can also do this using the `in` operator, and we will get the same result. - -[source,python] ----- -for ii in unique_intruder_ids: - found = False - for i in unique_ids: - if ii == i: - found = True - if not found: - print(ii) ----- -==== - -[TIP] -==== -Check out https://realpython.com/python-sets/#operating-on-a-set[this] great article on sets. -==== - -[TIP] -==== -Now that you found the `station_id` of the intruder -- you will need to use `pandas` indexing to print the entire row of data. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html[This] documentation should help. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Run the following to see a very simple example of using `matplotlib`. - -[source,python] ----- -import matplotlib.pyplot as plt - -# now you can use it, for example -plt.plot([1,2,3,5],[5,6,7,8]) -plt.show() -plt.close() ----- - -There are a myriad of great https://matplotlib.org/stable/gallery/index.html[examples] and https://matplotlib.org/stable/tutorials/index.html[tutorials] on how to use `matplotlib`. With that being said, it takes a lot of practice to become comfortable creating graphics. - -Read through the provided links and search online. Describe something you would like to plot from our dataset. Use any of the tools you've learned about to extract the data you want and create the described plot. Do your best to get creative, but know that expectations are low -- this is (potentially) the very first time you are using `matplotlib` _and_ we are asking you do create something without guidance. Just do the best you can and post questions in Piazza if you get stuck! The "best" plot will get featured when we post solutions after grades are posted. - -[NOTE] -==== -You could use this as an opportunity to practice with dicts, sets, and lists. You could also try and learn about and use some of the features that we haven't mentioned yet (maybe something from the 10 minute intro to pandas). Have fun with it! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project03.adoc deleted file mode 100644 index d3606da82..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project03.adoc +++ /dev/null @@ -1,291 +0,0 @@ -= STAT 19000: Project 3 -- Spring 2022 -:page-mathjax: true - -**Motivation:** We've now been introduced to a variety of core Python data structures. Along the way we've touched on a bit of pandas, matplotlib, and have utilized some control flow features like for loops and if statements. We will continue to touch on pandas and matplotlib, but we will take a deeper dive in this project and learn more about control flow, all while digging into the data! - -**Context:** We just finished a project where we were able to see the power of dictionaries and sets. In this project we will take a step back and make sure we are able to really grasp control flow (if/else statements, loops, etc.) in Python. - -**Scope:** Python, dicts, lists, if/else statements, for loops, break, continue - -.Learning Objectives -**** -- List the differences between lists & tuples and when to use each. -- Explain what is a dict and why it is useful. -- Demonstrate a working knowledge of control flow in python: if/else statements, while loops, for loops, etc. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` - -== Questions - -=== Question 1 - -++++ - -++++ - -Let's begin this project by taking another look at xref:spring2022/19000/19000-s2022-project02.adoc#question-4[question (4) from the previous project]. - -Although we were able to reduce the number of comparisons down _a lot_ (from around 15000000 squared to 40000 squared) -- it is still _terrible_ and very very slow. - -To see just how slow, let's time it! - -[source,python] ----- -from block_timer.timer import Timer -import pandas as pd - -# read in the intruder dataset and get the unique ids -df_intruder = pd.read_csv('/depot/datamine/data/noaa/2020_sampleB.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"]) -intruder_ids = df_intruder["station_id"].dropna().tolist() -unique_intruder_ids = list(set(intruder_ids)) - -# read in the original dataset and get the unique ids -df_original = pd.read_csv('/depot/datamine/data/noaa/2020_sample.csv', names=["station_id", "date", "element_code", "value", "mflag", "qflag", "sflag", "obstime"]) -original_ids = df_original["station_id"].dropna().tolist() -unique_ids = list(set(original_ids)) - -with Timer(): - # compare the two lists - for i in unique_intruder_ids: - if i not in unique_ids: - print(i) ----- - -Yikes! That's really not very good! - -So, what is the better way? To take advantage of the `set` object! Specifically, read the section titled "Operating on a Set" https://realpython.com/python-sets/#operating-on-a-set[here], and think of a better way to get this value! Test out the new method -- how fast was it compared to the method above? - -[NOTE] -==== -On Brown, mine was 958 times faster than the original method! Definitely a worthwhile trick to use! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Unlike in R, where traditional loops are rare and typically accomplished via one of the apply functions, in Python, loops are extremely common and important to understand. In Python, any iterator can be looped over. Some common iterators are: tuples, lists, dicts, sets, pandas Series, and pandas DataFrames. - -Let's get started by reading in our dataset and taking a look. - -[source,python] ----- -import pandas as pd - -df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";") ----- - -Use the following code to extract the sales amount in dollars into a list. - -[source,python] ----- -sales_list = df['Sale (Dollars)'].dropna().tolist() ----- - -Write a _loop_ that uses `sales_list` and sums up the total sales, and prints the _average_ sales amount. - -Of course, `pandas` provides a method to iterate over the `Sale (Dollars)` Series as well! It would start as follows. - -[source,python] ----- -for idx, val in df['Sale (Dollars)'].dropna().iteritems(): - # put code here for series loop ----- - -Use this method to calculate the average sales amount. Which is faster? Fill in the following skeleton code to find out. - -[source,python] ----- -from block_timer.timer import Timer - -with Timer(title="List loop"): - # code for list loop - -with Timer(title="Series loop"): - # code for series loop ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -You may have been surprised by the fact that iterating through the Series was slower than iterating through a list. https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758[Here] is a good post explaining why it is so slow! - -So why use `pandas`? Well, it starts to be pretty great when you can take advantage of vectorization. - -Let's do a new exercise. Instead of calculating the average sales amount, let's calculate the z-scores of the sales amounts. Just like before, do this using 2 methods. The first is to just use for loops, the `len` function, and the `sum` function. The second is to use `pandas`. I've provided you with the pandas solution. - -How do you calculate a z-score? - -$\frac{x_i - \mu}{\sigma}$ - -Where - -$\sigma = \sqrt{\sum_{i=0}^n{\frac{(x_i - \mu)^{2}}{n}}}$ - -$n$ is the number of elements in the list. - -$x_i$ is the ith element in the list. - -$\mu$ is the mean of the list. - -$\sigma$ is the standard deviation of the list. - -Give it a shot and fill in the code below. What do the results look like? - -[source,python] ----- -import pandas as pd -from block_timer.timer import Timer - -# df = pd.read_csv("/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt", sep=";") -sales_list = df['Sale (Dollars)'].dropna().tolist() - -with Timer(title="Loops"): - - # calculate the mean - mean = sum(sales_list)/len(sales_list) - - # calculate the std deviation - # you can use **2 to square a value and - # **0.5 to square root a value - - # calculate the list of z-scores - - # print the first 5 z-scores - print(zscores[:5]) - -with Timer(title="Vectorization"): - print(((df['Sale (Dollars)'] - df['Sale (Dollars)'].mean())/df['Sale (Dollars)'].std()).iloc[0:5]) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -While it is nearly always best to try and vectorize your code when using `pandas`, sometimes it isn't possible to do perfectly, or it just isn't worth the time to do it. For this question, we don't care about vectorization. - -We want to look at `Volume Sold (Gallons)` by `Store Number`. Start by building a dict called `volume_dict` that maps `Store Number` to `Volume Sold (Gallons)`. - -Since we only care about those two columns now, let's remove the rest. - -[source,python] ----- -df = df.loc[:, ('Store Number', 'Volume Sold (Gallons)')] ----- - -You can loop through the DataFrame as follows. - -[source,python] ----- -for idx, row in df.iterrows(): - # print(idx, row) ----- - -There, `idx` contains the row index, and `row` contains a Series object containing the row of data. You could then access either of the column using either `row['Store Number']` or `row['Volume Sold (Gallons)']`. - -Build your `volume_dict`. - -[TIP] -==== -Remember, you will need to instantiate each key in the dict to prevent `KeyError`s. Alternatively, you can use a defaultdict. A defaultdict is a dict that will automatically instantiate a new key to a particular value. You could for example do the following. - -[source,python] ----- -from collections import defaultdict - -volume_dict = defaultdict(int) ----- - -Then, by default, all keys will be instantiated to 0. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Great! Now you have your `volume_dict`. Write a loop that loops through your `volume_dict` and prints the `Store Number` and `Volume Sold (Gallons)` for each key. If the volume sold is less than 100000 use the `continue` keyword to skip printing anything. If the volumn sold is greater than 149999, print "HIGH: " before the store number, if the volume sold is less than 150000 print "LOW: " before the store number. - -The output should be the following. - -.Output ----- -LOW: 2190.0 -HIGH: 4829.0 -HIGH: 2633.0 -HIGH: 2512.0 -LOW: 3494.0 -LOW: 2625.0 -HIGH: 3420.0 -LOW: 3952.0 -HIGH: 3385.0 -LOW: 3354.0 -LOW: 3814.0 ----- - -[TIP] -==== -The `continue` keyword skips the rest of the code in the loop, and progresses to the next iteration. -==== - -[TIP] -==== -In Python, there is if/elif/else. Elif stands for "else if". -==== - -[TIP] -==== -To iterate through a dictionary, you can use the `items` method. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project04.adoc deleted file mode 100644 index 524a5413e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project04.adoc +++ /dev/null @@ -1,240 +0,0 @@ -= STAT 19000: Project 4 -- Spring 2022 - -**Motivation:** Up until this point we've utilized bits and pieces of the pandas library to perform various tasks. In this project we will formally introduce pandas and numpy, and utilize their capabilities to solve data-driven problems. - -**Context:** By now you'll have had some limited exposure to pandas. This is the first in a three project series that covers some of the main components of both the numpy and pandas libraries. We will take a two project intermission to learn about functions, and then continue. - -**Scope:** python, pandas - -.Learning Objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Use numpy, scipy, and pandas to solve a variety of data-driven problems. -- Demonstrate the ability to read and write data of various formats using various packages. -- View and access data inside DataFrames, Series, and ndarrays. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/stackoverflow/unprocessed/2021.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -The following is an example showing how to time creating a dictionary in two different ways. - -[source,python] ----- -from block_timer.timer import Timer - -with Timer(title="Using dict to declare a dict") as t1: - my_dict = dict() - -with Timer(title="Using {} to declare a dict") as t2: - my_dict = {} - -# or if you need more fine-tuned values -print(t1.elapsed) -print(t2.elapsed) ----- - -There are a variety of ways to store, read, and write data. The most common is probably still `csv` data. `csv` data is simple, and easy to understand, however, it is a horrible format to read, write, and store. It is slow to read. It is slow to write. It takes up a lot of space. - -Luckily, there are some other great options! - -Check out the `pandas` documentation showing the various methods used to read and write data: https://pandas.pydata.org/docs/reference/io.html - -Read in the `2021.csv` file into a `pandas` DataFrame called `my_df`. Use the `Timer` to time writing `my_df` out to `/scratch/brown/ALIAS/2021.csv`, `/scratch/brown/ALIAS/2021.parquet`, and `/scratch/brown/ALIAS/2021.feather`. - -[IMPORTANT] -==== -Make sure to replace "ALIAS" with your purdue alias. -==== - -Use f-strings to print how much faster writing the `parquet` format was than the `csv` format, as a percentage. - -Use f-strings to print how much faster writing the `feather` format was than the `csv` format, as a percentage. - -You should now have 3 files in your `$SCRATCH` directory: `2021.csv`, `2021.parquet`, and `2021.feather`. - -Use the `Timer` to time reading in the `2021.csv`, `2021.feather`, and `2021.parquet` files into `pandas` DataFrames called `my_df`. - -Use f-strings to print how much faster reading the `parquet` format was than the `csv` format, as a percentage. - -Use f-strings to print how much faster reading the `feather` format was than the `csv` format, as a percentage. - -Round percentages to 1 decimal place. See https://miguendes.me/73-examples-to-help-you-master-pythons-f-strings#how-to-format-a-number-as-percentage[here] for examples on how to do this using f-strings. - -Finally, how much space does each file take up? Use f-strings to print the size in MB. - -[TIP] -==== -There are a couple of options on how to get file size, https://stackoverflow.com/questions/2104080/how-can-i-check-file-size-in-python[here]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -If you haven't already, please check out and walk through the https://pandas.pydata.org/docs/user_guide/10min.html#[10 minute intro to pandas]. It is a really great way to get started using `pandas`. - -Also, check out xref:book:python:pandas-indexing.adoc[this] and https://pandas.pydata.org/docs/user_guide/indexing.html[this]. - -A _method_ is a function that is associated with a particular class. For example, `mean` is a method of the `pandas` DataFrame object. - -[source,python] ----- -# myDF is an object of class DataFrame -# mean is a method of the DataFrame class -myDF.mean() ----- - -Typically, when using `pandas`, you will be working with either a DataFrame or a Series. The DataFrame class is what you would normally think of when you think about a data frame. A Series is essentially 1 column or row of data. In `pandas`, both Series and DataFrames have _methods_ that perform various operations. - -Use indexing and the `value_counts` method to get and print the count of `Gender` for survey respondents from Indiana. - -Next, use the `plot` method to generate a plot. Use the `rot` option of the `plot` method to rotate the x-labels so they are displayed vertically. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Let's figure out whether or not `YearsCode` is associated with `ConvertedCompYearly`. Get an array of unique values for the `YearsCode` column. As you will notice, there are some options that are not numeric values! In fact, when we read in the data, because of these values ("Less than 1 year", "More than 50 years", etc.), `pandas` was unable to choose an appropriate data type for that column of data, and set it to "Object". Use the following code to convert the column to a string. - -[source,python] ----- -my_df['YearsCode'] = my_df['YearsCode'].astype("str") ----- - -Great! Now that column contains strings. Use the `replace` method with `regex=True` to replace all non numeric values with nothing! - -[source,python] ----- -my_df["YearsCode"] = my_df['YearsCode'].replace("[^0-9]", "", regex=True) ----- - -Next, use the `astype` method to convert the column to "int64". - -Finally, use the `plot` method to plot the `YearsCode` on the x-axis and `ConvertedCompYearly` on the y-axis. Use the `kind` argument to make it a "scatter" plot and set the `logy=True`, so large salaries don't ruin our plot. - -Write 1-2 sentences with any observations you may have. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Check out the `LanguageHaveWorkedWith` column. It contains a semi-colon separated list of languages that the respondent has worked with. Pretty cool. - -How many times is each language listed? If you get stuck, refer to the hints below. What languages have you worked with from this list? - -[TIP] -==== -You can start by converting the column to strings. - -[source,python] ----- -my_df['LanguageHaveWorkedWith'] = my_df['LanguageHaveWorkedWith'].astype(str) ----- -==== - -[TIP] -==== -This function can be used to "flatten" a list of lists. - -[source,python] ----- -def flatten(t): - return [item for sublist in t for item in sublist] - -flatten([[1,2,3],[4,5,6]]) ----- - -.Output ----- -[1, 2, 3, 4, 5, 6] ----- -==== - -[TIP] -==== -You can apply any of the https://www.w3schools.com/python/python_ref_string.asp[Python string methods] to an entire column of strings in `pandas`. For example, I could replace every instance of "hello" with nothing as follows. - -[source,python] ----- -myDF['some_column_of_strings'].str.replace("hello", "") ----- -==== - -[TIP] -==== -Check out the `split` string method. -==== - -[TIP] -==== -You could use a dict to count each of the languages, _or_, since this is a `pandas` project, you could convert the list to a `pandas` Series and use the `value_counts` method! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -`pandas` really helps out when it comes to working with data in Python. This is a really cool dataset, use your newfound skills to do a mini-analysis. Your mini-analysis should include 1 or more graphics, along with some interesting observation you made while exploring the data. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project05.adoc deleted file mode 100644 index b423ff405..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project05.adoc +++ /dev/null @@ -1,269 +0,0 @@ -= STAT 19000: Project 5 -- Spring 2022 -:page-mathjax: true - -**Motivation:** We will pause in our series of `pandas` and `numpy` projects to learn one of the most important parts of writing programs -- functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable. - -**Context:** We are focusing on learning about writing functions in Python. - -**Scope:** python, functions, pandas, matplotlib - -.Learning Objectives -**** -- Understand what a function is. -- Understand the components of a function in python. -- Differentiate between positional and keyword arguments. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/whin/190/stations.csv` -- `/depot/datamine/data/whin/190/observations.csv` - -== Questions - -[NOTE] -==== -We are very lucky to have great partners in the Wabash Heartland Innovation Network (WHIN)! They generously provide us with access to their API (https://data.whin.org/[here]) for educational purposes. You’ve most likely either used their API in a previous project, or you’ve worked with a sample of their data to solve some sort of data-driven problem. - -In this project, we will be using a slightly modified sample of their dataset to learn more about how to write functions. -==== - -=== Question 1 - -++++ - -++++ - -First, read both datasets into variables named `stations` and `obs`. -Secondly, take a look at the `head` of both dataframes. You will notice, the `station_id` in the `obs` dataframe appears to correlate with the `id` column in the `stations` dataframe. This is a fairly common occurence when data has been _normalized_ for a database. -For our current project we will work with a single dataset. - -`pandas` has a `merge` method that can be used to join two dataframes based on a common column. Here the `id` column from the `stations` dataframe matches the `station_id` column in the `obs` dataframe. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html[Here] is the explanation on `merge`. - -[TIP] -==== -Use `left_on` to specify the name of the column in the "left" dataframe. Use `right_on` to specify the name of the column in the "right" dataframe. Make the "left" dataframe be `obs`. Use the value "left" for the `how` argument to specify a left join. -==== - -Once merged, you will notice in the new dataframe, `dat`. That the `id` column from the `obs` dataframe is now labeled `id_x`, and the `id` column from the `stations` dataframe is now labeled `id_y`. - -Use the `pandas` `drop` method to remove the `id_y` column. -Use the `pandas` `rename` method to rename `id_x` to `id`, and `name` to `station_name`. - -Great! We have cleaned up our dataframe so it is easier to work with, while learning a variety of useful `pandas` methods. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -When looking at the new dataset, you may have noticed a mix of letters and numbers in the `id` column. Below are a few samples of the contents of that column. - -.id sample ----- -obs_1NnyYGMtAHBFDYWOBlsDlqppzVI -obs_1No0NHuqV4VjOK8p8FguPT02T5B -obs_1NqnftCklLZHBCHyykvcuc8QvE9 -obs_1NqpV058q10hGNBNvYOBzzwpqOx -obs_1NqrK3mraUzaj2j7hg6VcB23RjJ ----- - -The use of numbers and letters in this column are a variation on https://github.com/segmentio/ksuid[ksuid] -- a K-sortable globally unique id. -The reason they are beneficial is that Ksuids are sortable by time and unique identifers (there is a minimal chance that any two id's would be the same). -If you are interested you can read more https://segment.com/blog/a-brief-history-of-the-uuid/[here]. - - -Next, write a function called `get_datetime` that accepts a ksuid (as a string) and returns the `datetime`. - -[TIP] -==== -You can use the `parse` method to decode a ksuid. - -[source,python] ----- -from cyksuid import ksuid - -mydatetime = ksuid.parse('1NnyYGMtAHBFDYWOBlsDlqppzVI').datetime ----- - -Don't forget to remove the "obs_" from the beginning of the ksuid. -==== - -The following code should result in the following output. - -[source,python] ----- -for k in ksuids: - print(get_datetime(k)) ----- - -.Output ----- -2019-07-10 04:00:00 -2019-07-10 04:15:00 -2019-07-11 04:00:00 -2019-07-11 04:15:00 -2019-07-11 04:30:00 ----- - -To verify that the ordering claim is true, (for example,the sorting of ksuids resulted in obervations are in chronological order). -We must first, use the `sample` method to get 10 random `id` values from the `dat` dataframe. - -Secondly sort the values, then loop through the sorted list of values, and use your `get_datetime` function to print the datetime. - -Can you confirm that sorting the ksuids automatically sorts the observations by datetime? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -In this dataset we are given `latitude` and `longitude` values in degrees. We want to convert the degrees to radians. Write a function called `degrees_to_radians` that accepts a latitude or longitude value in degrees, and returns the same value in radians. - -The formula to do this is. - -$degrees*arctan2(0, -1)/180$ - -[TIP] -==== -`numpy` has all of the needed functions for this! - -[source,python] ----- -import numpy as np - -np.arctan2() ----- -==== - -[TIP] -==== -Make sure to convert your result from a `pandas` Series to a `float`. -==== - -To test out your function you can use: - -[source,python] ----- -degrees_to_radians(88.0) ----- - -.Output ----- -1.53588974175501 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Write a function that accepts two `pandas` Series containing a `latitude` and `longitude` value. Also needs to be able to return the distance between two points in Kilometers. Call this function `get_distance`. - -++++ - -++++ - -You can do this by using the https://en.wikipedia.org/wiki/Haversine_formula[Haversine formula]. - -$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$ - -Where: - -- $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers -- $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points -- $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points - -[TIP] -==== -In the formula above, the latitude and longitudes need to be converted from degrees to radians. Your function from the Question 3 will be perfect for this! - -You can even put your `degrees_to_radians` function in the `get_distance` function. Any "nested" function (a function within a function) can be called a "helper" function. If you have code that will be used multiple times it is beneficial to create a "helper" function. - -It is common practice in the Python world to add an underscore as a prefix to helper functions. It is a sign that this function is just for "internal" use and should largly be ignored by the user. Follow this practice and prefix your `degrees_to_radians` function with an underscore. -==== - -[TIP] -==== -`numpy` has all of the needed functions for this! - -[source,python] ----- -import numpy as np - -np.arcsin() -np.cos() -np.sin() ----- -==== - -Test your function on the 2 rows with the following `id` values. - -.id sample ----- -obs_1amnn4xst3O9VOawmUHFiqBVnCK -obs_1fwlznMZXXS8WBkmyTHRgWnHYYf ----- - -.Results ----- -37.896692299010574 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Great! Make sure to note these solutions for future use... - -Next, write a function called `plot_stations`. `plot_stations` should accept a dataset as an argument and produce a plot with the station locations plotted on a map. - -For consistancy we will use `plotly` to produce the plot. https://stackoverflow.com/questions/53233228/plot-latitude-longitude-from-csv-in-python-3-6[This] stackoverflow post will show some samples. For further understanding https://plotly.com/python-api-reference/generated/plotly.express.scatter_geo.html[here] is the explanation for the function. - -We want to be careful we don't plot the same point over and over. To avoid that we want to make sure we reduce the dataset (inside the function), this will plot each pair of latitude and longitude values only once. - -Set `hover_name` to "station_id" so that hovering over a point will displays the station id. - -Set `scope` to "usa" to reduce the map to the USA. Be sure to zoom in on the map so you can see the the stations within Indiana! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project06.adoc deleted file mode 100644 index 8c439fee6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project06.adoc +++ /dev/null @@ -1,197 +0,0 @@ -= STAT 19000: Project 6 -- Spring 2022 - -**Motivation:** We will pause in our series of pandas and numpy projects to learn one of the most important parts of writing programs — functions! Functions allow us to reuse snippets of code effectively. Functions are a great way to reduce the repetition of code and also keep the code organized and readable. - -**Context:** We are focusing on learning about writing functions in Python. - -**Scope:** python, functions, pandas, matplotlib - -.Learning Objectives -**** -- Understand what a function is. -- Understand the components of a function in python. -- Differentiate between positional and keyword arguments. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/whin/190/combined.csv` -- `/depot/datamine/data/flights/subset/*.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -[WARNING] -==== -When submitting your .ipynb file for this project, if the .ipynb file doesn't render in Gradescope, please export the notebook as a PDF and submit that as well -- you will be helping the graders a lot! -==== - -In project 5, you read in two separate, but related datasets, and used the `pandas` `merge` method to combine them. In this project, we've provided you with a combined dataset already, called `combined.csv`. - -Read in `combined.csv` into a data frame called `dat`. - -Your friend shared the following code with you. - -[source,python] ----- -import plotly.express as px - -def plot_stations(df, *ids): - df = df.groupby("station_id").head(1).loc[df['station_id'].isin(ids), ('station_id', 'latitude', 'longitude')] - fig = px.scatter_geo(df, lat="latitude", lon="longitude", scope="usa", - hover_name="station_id") - fig.update_layout(geo = dict(projection_scale=7, center=dict(lat=df['latitude'].iloc[0], lon=df['longitude'].iloc[0]))) - fig.show(renderer="jpg") ----- - -[IMPORTANT] -==== -In order for your plotly maps to show up properly in Gradescope, you must use the `renderer="jpg"` option. In addition, this removes the ability to zoom in on the map. The update_layout method can be safely copied and pasted into _all_ following functions you write so that the images that are rendered are zoomed in. This is critical so graders can properly see your work -- feel free to post questions in Piazza if you have questions. -==== - -Please do the following: - -- Give a 1-2 sentence explanation of what this function does. -- Use the function to plot 2 or more stations, BUT, use the function in two different ways (with the same result). Use tuple _unpacking_ in 1 call of the `plot_stations` function, and do _not_ in the other. - -[TIP] -==== -I would _highly_ recommend taking the time to read through the entire article https://realpython.com/defining-your-own-python-function/[here]. It is a very detailed article going through all the things you can do with functions in Python. The section on https://realpython.com/defining-your-own-python-function/#argument-tuple-packing[tuple packing] and https://realpython.com/defining-your-own-python-function/#argument-tuple-unpacking[tuple unpacking] may be particularly useful to you! -==== - -[TIP] -==== -The documentation for the `scatter_geo` function can be found https://plotly.com/python-api-reference/generated/plotly.express.scatter_geo[here]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -In project 5, question 5, you wrote a function called `plot_stations` that given a data frame, would plot the locations of the stations on the map. - -Modify your `plot_stations` function so that it has an argument called `weighted` that defaults to `False`. If `weighted` is `True`, then the stations will be plotted with a size proportional to the number observations at each station. - -[source,python] ----- -plot_stations(df) # plots all stations same size -plot_stations(df, weighted=False) # plots all stations same size -plot_stations(df, weighted=True) # plots all stations with size proportional to the number of observations for the station ----- - -[TIP] -==== -You can find the documentation on the `scatter_geo` function https://plotly.com/python-api-reference/generated/plotly.express.scatter_geo[here]. -==== - -[TIP] -==== -https://realpython.com/defining-your-own-python-function/#default-parameters[This] section will review default parameters. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -There are many columns in our data frame with numeric data. Some examples are: `temperature_high`, `temperature_low`, `barometric_pressure`, `wind_speed_high`, etc. Wouldn't it be (kind of) cool to have an option in our `plot_stations` function that would weight the size of the points on the map based on those values instead of the number of observations? - -Modify the function so that it has another argument called `weight_by` that defaults to `None`. If `weight_by` is `None` (and `weighted` is `True`), the points on the plot should be sized by number of observations (like in question 2). Otherwise, `weight_by` can accept a string with the name of the column to base the point sizes on. For example: `plot_stations(dat, weighted=True, weight_by="temperature_high"` would create a plot where the size of the points are based on the _median_ value of `temperature_high` by station. - -[IMPORTANT] -==== -Please note, if weighted is `False`, then points should not be weighted regardless of the value of `weight_by`. -==== - -Of course, not all of the columns in our dataset are appropriate to weight by. Please demonstrate your function works by running the following calls to `plot_stations`. - -[source,python] ----- -plot_stations(dat, weighted=True, weight_by="temperature_high") -plot_stations(dat, weighted=True, weight_by="temperature_low") -plot_stations(dat, weighted=True, weight_by="wind_speed_high") -plot_stations(dat, weighted=False, weight_by="barometric_pressure") -plot_stations(dat, weighted=True, weight_by=None) ----- - -[NOTE] -==== -The wind_speed_high plot will have the most pronounced differences in size, but still rather small. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -You've learned a lot about plotting maps in plotly, the `groupby` method (most likely), and hopefully functions as well! - -Check out all of the datasets in the `/depot/datamine/data/flights/subset` directory. Write a function that creates _any_ new plot using some or all of the data in the `subset` directory. The plots could be maps, other plots, anything you want! The goal should be to make the function useful for exploring flight data in the provided format. Take advantage of the tuple packing and unpacking, default arguments, etc. You could even have a function _inside_ another function (a helper function). Do you best to challenge yourself and have fun. Any solid effort will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 (optional, 0 pts) - -++++ - -++++ - -Write a function that accepts the WHIN weather dataset (as a data frame), and an argument _n_. This function should plot the largest _n_ distances between stations on a map. See https://plotly.com/python/lines-on-maps/[here] for examples of plotting lines on a map. - -If you are feeling very adventurous, there is a data structure called a kdtree that you can use to very efficiently find the _n_ closest or furthest points, however, this is probably not necessary as there are not _that_ many distances to calculate for this dataset. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project07.adoc deleted file mode 100644 index 9a154db1a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project07.adoc +++ /dev/null @@ -1,505 +0,0 @@ -= STAT 19000: Project 7 -- Spring 2022 - -**Motivation:** `pandas` provides a lot of very useful functionality when working with data. It provides a method for nearly every common pattern found when data wrangling. In this project, we will utilize some of the most popular methods to solve data driven problems and learn. - -**Context:** At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using `pandas` by using `pandas` to perform some of the most common patterns found when data wrangling. - -**Scope:** pandas, Python - -.Learning Objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Demonstrate the ability to use `pandas` and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/disney/*.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -In this project, you will be using our Disney dataset. There is a single csv file for each ride, as well as a `metadata.csv` file that contains information relevant to all rides, by day. - -Each ride file has the following four columns: `date`, `datetime`, `SACTMIN`, and `SPOSTMIN`. Each row in the ride files represents a single observation. The `datetime` is the date and time (to the second) of the observation. `SACTMIN` contains the _actual_ wait time (in minutes) of the given ride. `SPOSTMIN` is the posted wait time. `SPOSTMIN` may have a value of -999, which represents a ride being closed. - -While not (on the whole) a particularly large dataset, it can take some work to process the data. - -The first, low-hanging fruit would be to combine all of the ride datasets into a single dataset with the following columns: `datetime`, `SACTMIN`, `SPOSTMIN`, and `ride_name`. - -[NOTE] -==== -The `ride_name` column should be the name of the file without the `.csv` extension. -==== - -[NOTE] -==== -We will expect you to remove the `date` column for now, since that information is contained in the `datetime` column. -==== - -As mentioned earlier, `SPOSTMIN` may have a value of -999, which represents a ride being closed. Instead of combining what is really the posted wait time _and_ and indicator variable indicating the _status_ of the ride, let's represent these two things separately. Create a new column called `status` that has the value `closed` if the value of `SPOSTMIN` is -999, and `open` otherwise. Replace occurences of -999 in the `SPOSTMIN` column with `np.nan`. - -Finally, let's set each column to the appropriate data type, and "reset" the index. - -To summarize the tasks: - -- Combine all of the ride files into a single dataframe, and add a column called `ride_name` with the name of the given ride. -+ -[TIP] -==== -To help you programmatically loop through the files without typing up all of the file names, you could do something like the following. - -[source,python] ----- -import numpy as np -import pandas as pd -from pathlib import Path - -csv_files = Path('/depot/datamine/data/disney').glob('*.csv') - -for csv in csv_files: - # skip metadata and entities files - if csv.name == 'metadata.csv' or csv.name == 'entities.csv': - continue - df = pd.read_csv(csv) - print(csv.name) - print(df.head()) ----- -==== -+ -[TIP] -==== -To create a new column in a dataframe with a single value, is easy. - -[source,python] ----- -some_df['new_col'] = 'some value' ----- - -This will create a new column called `new_col` with the value `some value` in each row. -==== -+ -[TIP] -==== -To help you combine dataframes, you could use the `pandas` concat function. - -See https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat[here]. -==== -+ -- Remove the `date` column. -+ -[TIP] -==== -To remove a column from a dataframe, you can use the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop[`drop`] method. I like to use the `columns` parameter. Remember, if you have `inplace=False` (the default), you will need to reassign the dataframe with the output. Something like the following. - -[source,python] ----- -some_df = some_df.drop(columns=["some_column"]) ----- - -If you have `inplace=True`, you can just do the following, and `some_df` will be updated _in place_. - -[source,python] ----- -some_df.drop(columns=["some_column"], inplace=True) ----- -==== -+ -- Create a new column called `status` that has the value `closed` if the value of `SPOSTMIN` is -999, and `open` otherwise. -+ -[TIP] -==== -You could achieve this by first setting all values of your new column `status` to `open` (see the earlier tip about creating a new column). Then, you can use indexing to isolate all values in the `SPOSTMIN` that are -999, and set them to be `closed`. -==== -+ -- Replace -999 in the `SPOSTMIN` column with `np.nan`. -- Set each column to the appropriate data type. -+ -[TIP] -==== -Here is one way to convert each column to the appropriate data type. - -[source,python] ----- -dat["SACTMIN"] = pd.to_numeric(dat["SACTMIN"]) -dat["SPOSTMIN"] = pd.to_numeric(dat["SPOSTMIN"]) -dat["datetime"] = pd.to_datetime(dat["datetime"]) -dat["ride_name"] = dat["ride_name"].astype("category") -dat["status"] = dat["status"].astype("category") ----- -==== -+ -- Reset the index by running `dat.reset_index(drop=True, inplace=True)`. -+ -[TIP] -==== -Resetting the index will set your index to 0 for row 1, 1 for row 2, etc. This is important to do after combining dataframes that have different indices. Otherwise, using `.loc` may cause unexpected errors since `.loc` is _label_ based. -==== - -[TIP] -==== -The following is some output to help you determine if you have done this correctly. - -[source,python] ----- -print(dat.dtypes) ----- - -.Output ----- -datetime datetime64[ns] -SACTMIN float64 -SPOSTMIN float64 -ride_name category -status category -dtype: object ----- - -[source,python] ----- -print(dat.shape) ----- - -.Output ----- -(3443445, 5) ----- - -[source,python] ----- -dat.sort_values("datetime").head() ----- - -.Output ----- - datetime SACTMIN SPOSTMIN ride_name status -1209236 2015-01-01 07:45:15 NaN 10.0 soarin open -2019300 2015-01-01 07:45:15 NaN 5.0 spaceship_earth open -1741791 2015-01-01 07:46:22 NaN 5.0 rock_n_rollercoaster open -1484006 2015-01-01 07:47:26 NaN 5.0 kilimanjaro_safaris open -2618179 2015-01-01 07:47:26 NaN 5.0 expedition_everest open ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Wow, question 1 was a lot to do. You will find that a lot of up front work spent cleaning up the dataset will pay dividends in the future. - -The purpose of this project is to get used to the `pandas` library, and perform tasks that you will likely run into in your data science career. Let's take some time getting a feel for our data with some summaries. - -[TIP] -==== -In a browser, pull up the https://pandas.pydata.org/docs/reference/index.html[`pandas API reference (click here)`]. The `pandas` library is pretty large and not easy to memorize. I find it very worthwhile to pull up the API documentation, and use its search feature. - -By looking in the `DataFrame` section, you can see all of the methods that are available to you when working with the DataFrame. - -By looking in the `Series` section, you can see all of the methods that are available to you when working with a Series (a column in your DataFrame). -==== - -Use the `describe` method to get a quick summary of the data. While that output is useful, perhaps it would be more useful to see that information broken down by `ride_name`. Use the `groupby` method to first group by the `ride_name` and _then_ use the `describe` method. - -The `groupby` method is powerful. By providing a list of column names, `pandas` will group the data by those columns. Any further chained methods will then be applied to the data at that _group_ level. For example, if you had vaccination data that looks similar to the following. - -.Data sample ----- -person_id,state,vaccine_type,age,date_given -1,OH,Hepatitis A,22,2015-01-01 -1,OH,Hepatitis B,22,2015-01-01 -2,IN,Chicken Pox,12,2015-01-01 -3,IN,Hepatitis A,35,2015-01-01 -4,IN,Hepatitis B,18,2015-01-01 -3,IN,COVID-19,35,2015-01-01 ----- - -Using `pandas`, we could get the vaccination count by state as follows. - -[source,python] ----- -dat.groupby("state").count() ----- - -Or, we could get the average vaccination age by state as follows. - -[source,python] ----- -dat.groupby("state")["age"].mean() ----- - -If it makes sense, we can group by multiple columns at once. For instance, if we wanted to get the count of `vaccination_type` by `state` and `age`, we could do the following. - -[source,python] ----- -dat.groupby(["state", "age"])["vaccination_type"].count() ----- - -Chain some `pandas` methods together, to get the mean `SACTMIN` and `SPOSTMIN` by `ride_name`, sorted from from highest mean `SACTMIN` to lowest. - -[TIP] -==== -The `groupby`, `mean`, and `sort_values` methods from `pandas` are what you need to solve this problem. Check out the arguments for the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html[`sort_values`] method to figure out how to sort from largest to smallest. In general, opening up the documentation and looking at the arguments is a good practice -- you never know what useful feature a method may have! -==== - -[NOTE] -==== -When I say "chain" `pandas` methods, I mean that you can continue to call methods on the result of the previous method. For example, something like: `dat.groupby(["some_column"]).mean()`. This would first group by the "some_column" column, calculate the mean values for each column for each group. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -One key part of working with data is visualizing your data. `pandas` provides some nice built in methods to create plots. Create both a single `matplotlib` bar plot and the equivalent `plotly` plot of the median `SPOSTMIN` for each ride. - -[NOTE] -==== -You can create plots with 2 small 1-line `pandas` method chains. Search for "plot" in the `pandas` API documentation to find the appropriate methods and arguments. -==== - -[IMPORTANT] -==== -Make sure to use the "jpg" renderer for the plotly plot. This would be similar to the following. - -[source,python] ----- -fig = dat.groupby().mean().plot(...) -fig.show(renderer="jpg") ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Another really powerful feature in `pandas` is the `apply` method. The `apply` method allows you to apply a function to each element of a Series or DataFrame. Each element of a DataFrame is a Series containing either a row or column of data (depending on the value of the `axis` argument). - -In the previous two projects, you learned a lot about writing Python functions. Write a simple Python function that when applied to the dataframe that contains the median `SPOSTMIN` and `SACTMIN` values for each ride, returns the same dataframe but the wait time is shown in hours not minutes. Next, use the `query` method to return only the rides where the `SPOSTMIN` is 1 hour or more. - -[NOTE] -==== -You may or may not have noticed that the result of this questions solution and the previous questions solution are similar in that they both have `ride_name` as the _index_ of our dataframe rather than a column. This is fine for a lot of work, but it is important to be at the very least _aware_ that it is an index. To make `ride_name` a column again, you could do two different things. - -[source,python] ----- -dat.groupby("some_column").mean().reset_index() - -# or - -dat.groupby("some_column", as_index=False).mean() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -In the "tidyverse" in R, there is a very common pattern of writing code that creates new columns based on existing columns. Of course, this is easy to do in `pandas`, for example, given the following sample of data, you could create a new column that is the result of adding two existing columns together. - -.Data sample ----- -person_id,birth,death,state -1,1923,2001,IN -2,1930,1977,IN -3,1922,2017,IN -4,1922,2006,OH -5,1922,1955,OH -6,1940,2000,MO ----- - -[source,python] ----- -dat["age"] = dat["death"] - dat["birth"] ----- - -Not only is that easy, but it is very fast, and vectorized. However, let's say that instead, we want to create an `age_by_state` column that is the average age at death by state. Of course, this could be accomplished using `groupby`. - -[source,python] ----- -dat.groupby("state")["age"].mean() ----- - -With that being said, this results in multiple extra columns and the data is no longer on a 1 person per row basis. In the "tidyverse" in R, we could easily produce the following dataset as follows. - -.Data sample to produce ----- -person_id,birth,death,state,age_by_state -1,1923,2001,IN, 73.3 -2,1930,1977,IN, 73.3 -3,1922,2017,IN, 73.3 -4,1922,2006,OH, 58.5 -5,1922,1955,OH, 58.5 -6,1940,2000,MO, 60.0 ----- - -[source,r] ----- -library(tidyverse) - -dat %>% - group_by(state) %>% - mutate(age_by_state = mean(death - birth)) ----- - -How would we accomplish this using `pandas`? We would do so as follows. - -[source,python] ----- -dat.assign(age = lambda df: df['death'] - df['birth'], - age_by_state = lambda df: df.groupby('state')['age'].transform("mean"))\ - .drop(columns="age") ----- - -As you can see, this is not nearly as ergonomic in Python using `pandas` as it is using `tidyverse` in R. - -[NOTE] -==== -You may have noticed some weird "lambda" thing. This is called a lambda function -- in other languages it is sometimes called an anonymous function. It is a function that is defined without a name. It is useful for creating small functions. If instead of lambda functions, we used regular functions, our code would have looked like the following. - -[source,python] ----- -def first(df): - return df['death'] - df['birth'] - -def second(df): - return df.groupby('state')['age'].transform("mean") - -dat.assign(age = first, age_by_state = second).drop(columns="age") ----- -==== - -Create four new columns in the dataframe: `mean_wait_time_act`, `mean_wait_time_post`, `median_wait_time_act`, and `median_wait_time_post`. They should each contain the mean or median posted or actual wait time by `ride_name`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 (optional, 0 pts) - -Heatmaps can be hit or miss when considering their usefulness. Create a heatmap that visualizes the median `SACTMIN` for each ride by day of the week. - -[TIP] -==== -In the same way you can, in a vectorized way, perform string methods on a Series containing strings using the `.str` attribute (for example `dat["my_column_of_strings"].str.replace("$", "")` would remove the "$" from all of the strings in the column) -- you can do the same with datetimes using the `.dt` attribute. Check out the methods available to operate by searching for "Series.dt" in the `pandas` API documentation. https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.day_name.html#pandas.Series.dt.day_name[This] method will be particularly useful. -==== - -[TIP] -==== -https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html?highlight=pivot#pandas.DataFrame.pivot[This] is the documentation on the `pivot` method, which is a powerful method to reshape a dataset. -==== - -[TIP] -==== -Once you have a new column (let's call it `day`), you want to reshape the dataframe so that the `ride_name` is the row index, `day` is the column index, and the values in the cells are the median `SACTMIN` for the given `day` and `ride_name` combination. This can be achieved using `pivot`. - -In `pivot` the `index` argument is the name of the column of data that you want to be the row index. The `columns` argument is the name of the column of data that you want to be the column index. The `values` argument is the name of the column of data that you want to be the values in the cell. -==== - -[IMPORTANT] -==== -Make sure to use `groupby` and `median` to first group by both `ride_name` and `day`, then calculate the median for each of those combinations. Directly after calling `median`, make sure to call `reset_index` so the `day` and `ride_name` indices become columns again (before calling `pivot`). -==== - -[TIP] -==== -Once you have your pivoted data, you can plot the heatmap as follows. - -[source,python] ----- -import plotly.express as px - -fig = px.imshow(pivoted_data, aspect="auto") -fig.show(renderer="jpg") ----- -==== - -Look at your resulting heatmap. It is not particularly useful, althought you can see that flight of passage is super busy and spaceship earth, not so much. This doesn't really give us a good idea of how busy a _day_ is though, does it? - -What if we normalized the median `SACTMIN` by ride? That would let us compare how busy a ride is on a given day compared to how busy that same ride is on all other days. - -Normalize your pivoted data by ride. Do this by using the `apply` method. - -[source,python] ----- -def normalize(ride): - def _normalize(val, mi, ma): - return (val-mi)/(ma-mi) - - return(ride.apply(_normalize, mi=ride.min(), ma=ride.max())) - -pivoted_data.apply(normalize, axis=1) ----- - -Replot your heatmap with the normalized data. What day looks the most busy (anecdotally)? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project08.adoc deleted file mode 100644 index 66b325553..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project08.adoc +++ /dev/null @@ -1,201 +0,0 @@ -= STAT 19000: Project 8 -- Spring 2022 - -**Motivation:** Learning how to wrangle and clean up data using `pandas` is extremely useful. It takes lots of practice to start to feel comfortable. - -**Context:** At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using `pandas` by using `pandas` to solve data-driven problems. - -**Scope:** Python, pandas - -.Learning Objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Demonstrate the ability to use pandas and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/stackoverflow/unprocessed/2011.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -Take a look at the datasets in `/depot/datamine/data/stackoverflow/unprocessed/`. There are a variety of ways this dataset could be cleaned up. In this project, we will figure out how to clean these datasets up a bit, using `pandas`. - -Read in `2011.csv`. This is a comma-separated dataset. If you want a given csv file to be easy to parse through using a variety of tools, you should first make sure that the delimiter is a comma, and that there is exactly 1 less comma in each row than there is columns. - -Print the columns of the dataset that have commas in their content. Which columns have commas in their content, and how many commas are in each column, total? - -Results should look like: - -.Output ----- -Which best describes the size of your company?: 821.0 -Which of the following best describes your occupation?: 210.0 -Unnamed: 18: 1940.0 -... ----- - -[TIP] -==== -Remember, you can use string methods on a column with string data using the `.str` attribute in pandas. See https://pandas.pydata.org/docs/reference/api/pandas.Series.str.html?highlight=str#pandas.Series.str[here] for more information. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -It looks like there are a lot of commas in a lot of columns in our dataset. This _could_ make it more difficult to parse this dataset than necessary. - -[NOTE] -==== -For example, skip to question 3 and read the first paragraph. Not all analysis is done in R or Python! -==== - -Perform the same operations in question (1), but instead of looking for commas in the content of each column, look for semi-colons. - -Given the fact that we want our dataset easy to parse, and given what we know about the usage of commas and semi-colons, what would you suggest we do to clean up this dataset, and why? - -Hopefully, your answer was to convert instances of "internal" semi-colons to commas, so there are no remaining "internal" semi-colons, and only commas. This way, you can export the entire dataset to a dsv (delimiter-separated value) with semi-colons as the delimiter instead of a comma. - -Then, convert all instances of semi-colons to commas. - -[TIP] -==== -Double check by re-running your code that checks for semi-colons, to make sure they no longer exist. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -++++ - -++++ - -You may have noticed some other low-hanging fruit that could be cleaned up. You will notice that some columns have "Unnamed: X" (where X is a number) and essentially either contain some value, or are empty. These columns represent a potential answer to the most previous column not named "Unnamed: X" (again, where X is a number). - -Instead of having a separate column for each potential answer (as it currently is), _instead_, it would be much better to have a single column where each row could contain one of the categorical values that was originally shown in each of the "Unnamed: X" columns. Let's do the following for each set of "Unnamed: X" columns. - -. For each column, if the value is not `pd.NA`, then append it to a comma-separated list of values in the original question column. -+ -[TIP] -==== -So, for example, given the following example, we would want the following result. - ----- -Original question?; Unnamed: 1; Unnamed: 2; Unnamed: 3 -answerA;answerB;answerC;answerD -answerA; NA; NA; answerD ----- - ----- -Original question? -answerA,answerB,answerC,answerD -answerA,answerD ----- - -However, we would expect _all_ such columns to be combined, for each set of columns where the potential answer is broken into multiple columns. -==== -+ -[IMPORTANT] -==== -Remove commas in the given columns prior to pasting them together with commas. -==== -+ -. After, and _only after_ the columns have been combined, remove the "Unnamed: X" columns. That data is now redundant. - -[TIP] -==== -The original question column will be where the rest of the columns data is stored. Don't forget to include the value in the original question column in the final list of answers. -==== - -[TIP] -==== -You could use whether or not "unname" is in the column name to find and combine data as described above. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 (optional, 0 pts) - -++++ - -++++ - -In the previous questions, you were able to greatly simplify the dataset. This is great, however, let's try and automate this process in case we were to ever receive a dataset like this, but with different column names and values. Assume things would be in the same format, so a question with multiple choice answers will have columns called "Unnamed: X", immediately following the column with the actual question. - -Write a function called `fix_columns` that accepts a `pandas` DataFrame as an argument, changes all instances of semi colons to a comma within the "Unnamed: X" columns, and changes the column names as described above (including the eventual removal of the "Unnamed: X" columns). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Calculate a breakdown of the column "Which languages are you proficient in?". Create a graphic using the plotting package of your choice, showing the number of people who are proficient in the top 10 named languages (in order of most to least). Create this graphic using the cleaned up 2011 data. - -[WARNING] -==== -Remember, if you are using `plotly`, be sure to set `renderer="jpg"` so that your image appears in the notebook in Gradescope. If you notebook does not appear in Gradescope, you will not receive full credit. -==== - -[TIP] -==== -. You can now use string methods on that column to get the languages. -. There is a special `Counter` dict that could be useful. - -[source,python] ----- -from collections import Counter - -my_counter = Counter(['first', 'second', 'third', 'third', 'third']) -my_counter.update(['first', 'first', 'second']) -my_counter ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project09.adoc deleted file mode 100644 index 52fe3d160..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project09.adoc +++ /dev/null @@ -1,115 +0,0 @@ -= STAT 19000: Project 9 -- Spring 2022 - -**Motivation:** Learning how to wrangle and clean up data using pandas is extremely useful. It takes lots of practice to start to feel comfortable. - -**Context:** At this point in the semester, we have a solid grasp on the basics of Python, and are looking to build our skills using `pandas` by using `pandas` to solve data-driven problems. - -**Scope:** Python, pandas - -.Learning Objectives -**** -- Distinguish the differences between numpy, pandas, DataFrames, Series, and ndarrays. -- Demonstrate the ability to use pandas and the built in DataFrame and Series methods to perform some of the most common operations used when data wrangling. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/disney/total.parquet` - -== Questions - -=== Question 1 - -++++ - -++++ - -Let's start by reading in the cleaned up and combined dataset. This is just the cleaned up dataset -- essentially the same thing you got as a result from much of your processing from project 7. - -How many rows of data are there for each ride? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -Recall that a single row of data either has a value for `SPOSTMIN` or `SACTMIN`, but not both. How many rows of data are there in total? How many non-null rows for `SPOSTMIN`? How many non-null rows for `SACTMIN`? Create a new dataframe called `reduced` where: - -- Each row has a value for both `SPOSTMIN` and `SACTMIN`. The value in the `SPOSTMIN` column is the value for the closest `SPOSTMIN` value in seconds from the datetime shown for the `SACTMIN` value. -- There is a new column called `time_diff` that is the difference (in seconds) between the `SACTMIN` value and associated closest `SPOSTMIN` value. - -[TIP] -==== -This is the toughest question for this project. So it is OK if it takes you a bit more time to think of a solution. -==== - -[TIP] -==== -Check out the `shift` method in the `pandas` documentation. You _could_ write a function that operates on a single dataframe (think a dataframe for a single ride), and adds a variety of columns to the dataset using the `shift` method, and systematically sets the `SPOSTMIN` values and `time_diff` values accordingly. This method could then be applied using the `groupby` method. This is one potential way to solve the problem! - -Don't worry _too_ much about edge cases -- as long as you are close, you will get full credit. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -How many fewer rows does `reduced` have than the original dataset? What does the `time_diff` column look like? - -In project 7 you calculated the median `SPOSTMIN` and `SACTMIN` by `ride_name`. Perform the same operation on `reduced`. Are the `SACTMIN` and `SPOSTMIN` medians closer or further away than our not-cleaned data from project 7? - -Do you think that, overall, the data in `reduced` is close enough (by time) to be able to draw comparisons? Why or why not? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Any observation where the (absolute) `time_diff` is greater than an hour is probably not very high quality. Remove said observations from `reduced`. How many rows are left in `reduced`? - -Finally, explore the refined dataset, `reduced`, more. Write a question you would like to have answered down, what you think the answer will be, and do your best to used the dataset to answer your question. - -Your analysis should include: a question, your hypothesis, at least 1 graphic, any and all code you used, and your conclusions. You will not be graded on whether or not you are correct, but rather the effort you put into your analysis. Any good effort including the requirements will receive full credit. Have fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project10.adoc deleted file mode 100644 index d9743ed66..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project10.adoc +++ /dev/null @@ -1,413 +0,0 @@ -= STAT 19000: Project 10 -- Spring 2022 - -**Motivation:** We’d be remiss spending almost an entire semester solving data driven problems in python without covering the basics of classes. Whether or not you will ever choose to use this feature in your work, it is best to at least understand some of the basics so you can navigate libraries and other code that does use it. - -**Context:** We’ve spent nearly the entire semester solving data driven problems using Python, and now we are going to learn about one of the primary features in Python: classes. Python is an object oriented programming language, and as such, much of Python, and the libraries you use in Python are objects which have attributes and methods. In this project we will explore some of the terminology and syntax relating to classes. This is the first in a series of 3 projects focused on reading and writing classes in Python. - -**Scope:** Python, classes - -.Learning Objectives -**** -- Use classes to solve a data-driven problem. -- Understand and identify attributes and methods of a class. -- Differentiate between class attributes and instance attributes. -- Differentiate between instance methods, class methods, and static methods. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -Carefully read through https://thedatamine.github.io/the-examples-book/python.html#p-classes[this] quick walkthrough of classes in Python. In https://thedatamine.github.io/the-examples-book/projects.html#p12-190[previous] 190 projects, students built classes to represent decks of cards. Now, we've provided you with a couple of classes below. - -[source,python] ----- -class Card: - - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 11, "q": 12, "k": 13, "a": 14} - def __init__(self, number, suit): - if str(number).lower() not in [str(num) for num in range(2, 11)] + list("jqka"): - raise Exception("Number wasn't 2-10 or J, Q, K, or A.") - else: - self.number = str(number).lower() - if suit.lower() not in ["clubs", "hearts", "diamonds", "spades"]: - raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") - else: - self.suit = suit.lower() - - def __str__(self): - return(f'{self.number} of {self.suit.lower()}') - - def __repr__(self): - return(f'Card(str({self.number}), "{self.suit}")') - - def __eq__(self, other): - if self.number == other.number: - return True - else: - return False - - def __lt__(self, other): - if self._value_dict[self.number] < self._value_dict[other.number]: - return True - else: - return False - - def __gt__(self, other): - if self._value_dict[self.number] > self._value_dict[other.number]: - return True - else: - return False - - def __hash__(self): - return hash(self.number) - - -class Deck: - brand = "Bicycle" - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] - - def __len__(self): - return len(self.cards) - - def __getitem__(self, key): - return self.cards[key] - - def __setitem__(self, key, value): - self.cards[key] = value ----- - -There is a _lot_ to unpack here, but don't worry, we will cover it! - -_Instantiate_, or create an instance of, the class `Card` called `my_card`. Do the same for the `Deck` class, calling the created _object_ `my_deck`. Run the following. - -[source,ipython] ----- -print(my_card) ----- - -[source,ipython] ----- -my_card ----- - -What are the differences in the output? Which parts of the `Card` class controls the appearance of the outputs? What are those two special methods called (_something_ methods)? - -Now run the following. - -[source,ipython] ----- -print(my_deck) ----- - -What is printed? Modify the `Deck` class so that it prints "A bicycle deck.", where "bicycle" would be changed to "copag" if the _brand_ was changed to "copag". - -Make sure that your modification works! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -Okay great! You've already learned about one of the key types of methods in Python, and modified a class to fit your printing needs. Your friend is using your code at his company to track decks of cards their company uses. Previously, all of the decks of cards were Bicycle, however, they recently switched to Copag. Write a single line of code so that the brand is changed from "Bicycle" to "Copag" for _both_ decks. - -[source,python] ----- -deck1 = Deck() -deck2 = Deck() -print(deck1) -print(deck2) ----- - -[source,python] ----- -# add code here -print(deck1) -print(deck2) ----- - -.expected output ----- -A copag deck. -A copag deck. ----- - -Once you have that working as intended, explain what is going on. What type of attribute is `brand`? What happens if you did the same thing for the following code? - -[source,python] ----- -deck1 = Deck() -deck2 = Deck() -deck1.brand = "Aviator" -# add code to change both decks to "Copag" -print(deck1) -print(deck2) ----- - -Why does `deck1` now remain as "Aviator" and `deck2` as "Copag"? - -[TIP] -==== -https://stackoverflow.com/questions/58312396/why-does-updating-a-class-attribute-not-update-all-instances-of-the-class[This] stackoverflow post may be useful? -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -++++ - -++++ - -Okay, you are now going to create a new class called a `Player`. This class will be used to represent a player in a game. A player should have the following features: - -- A deck to draw from. -- A _hand_ of cards. -- A _name_ of the player. -- A _draw_ method that draws a card from the deck and adds it to the hand. - -Start by implementing the name attribute. Should the name attribute be a class attribute or an instance attribute? Why? - -Next, implement the very important, `__init__` method. What arguments should be passed to the `__init__` method, and why? - -[TIP] -==== -There should be 3 arguments passed to the `__init__` method. -==== - -As long as the following code runs properly and gives you the expected output (of course, the second two outputs just need to be _consistent_; they don't need to match our results), you are done with this problem. Great work! - -[source,python] ----- -my_deck = Deck() -# create player 1 here -player1 = ... -print(player1) ----- - -.expected output ----- -Chen Chen - -Top 5 cards: [Card(str(2), "clubs"), Card(str(3), "clubs"), Card(str(4), "clubs"), Card(str(5), "clubs"), Card(str(6), "clubs")] ----- - -[source,python] ----- -import random -# create player 2 here -player2 = ... -random.shuffle(my_deck) -print(player2) ----- - -.expected output ----- -Amy Sue - -Top 5 cards: [Card(str(q), "hearts"), Card(str(7), "diamonds"), Card(str(5), "spades"), Card(str(4), "diamonds"), Card(str(7), "spades")] ----- - -[source,python] ----- -print(player1) ----- - -.expected output ----- -Chen Chen - -Top 5 cards: [Card(str(q), "hearts"), Card(str(7), "diamonds"), Card(str(5), "spades"), Card(str(4), "diamonds"), Card(str(7), "spades")] ----- - -[NOTE] -==== -We shuffled `my_deck` it makes sense that both players should then have a deck that is equivalently shuffled! -==== - -[IMPORTANT] -==== -Make sure as you are updating the `Player` class, that you are running the code with the new updates to the class before using it. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Fantastic! Two common patterns that are important to be able to quickly recognize in many gin rummy games are sets and runs. - -A set is a group of cards with different suits but the same value. In order to qualify as a set, there must be 3 or more cards. - -A run is a group of cards with the same suit with sequential values. In order to qualify as a run, there must be 3 or more cards. - -Before we can write code to see if a given player has a set or a run, we need to modify our `Player` class so our players have a `hand` attribute. For now, the hand attribute can just be a Python list. When the `draw` method is called, a card is removed from the "top" of the deck and appended to the `hand` list. - -In addition, we need to write our first instance method -- `draw`! This method doesn't need to accept any arguments other than `self`, and it should simply remove one card from the deck and add it to the player's hand. Not too bad! Make sure that the following code works. - -[TIP] -==== -The following code may be useful when trying to figure out how to remove a card from a deck. - -[source,python] ----- -print(len(my_deck)) -card = my_deck.cards.pop(0) -print(card) -print(len(my_deck)) ----- -==== - -[source,python] ----- -import random - -fresh_deck = Deck() - -player1 = Player("Dr Ward", fresh_deck) - -# shuffle cards -random.shuffle(fresh_deck) - -player1.draw() -print(player1.hand) - -player1.draw() -print(player1.hand) - -player1.draw() -print(player1.hand) - -print(len(fresh_deck)) ----- - -.expected output ----- -[Card(str(a), "diamonds")] -[Card(str(a), "diamonds"), Card(str(9), "clubs")] -[Card(str(a), "diamonds"), Card(str(9), "clubs"), Card(str(k), "clubs")] -49 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Okay, great! - -Add a new instance method to the `Player` class. It should be called `has_set` and should return `True` if the player has a set (in their hand), and `False` otherwise. - -In the next project, we will discuss some ways to improve the functionality and implement more important features. For now, make sure that the following examples work. - -Run the following code as many times as needed until the result is `True`. Once the result is `True`, print the hand to verify that the player has a set: `print(player1.hand)`. - -[source,python] ----- -import random - -my_deck = Deck() -random.shuffle(my_deck) -player1 = ... # create player 1 here -for _ in range(10): # player draws 10 cards from the deck - player1.draw() - -player1.has_set() ----- - -.expected output (eventually) ----- -True ----- - -[source,python] ----- -print(player1.hand) ----- - -.expected output ----- -At least 3 cards with the same _value_. ----- - -[TIP] -==== -The `Counter` function from the `collections` module may be useful here. For example. - -[source,python] ----- -from collections import Counter - -my_list = [1, 1, 2, 3, 4] -my_result = Counter(my_list) - -for key, value in my_result.items(): - print(key, value) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project11.adoc deleted file mode 100644 index 543139b24..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project11.adoc +++ /dev/null @@ -1,276 +0,0 @@ -= STAT 19000: Project 11 -- Spring 2022 - -**Motivation:** We’d be remiss spending almost an entire semester solving data driven problems in python without covering the basics of classes. Whether or not you will ever choose to use this feature in your work, it is best to at least understand some of the basics so you can navigate libraries and other code that does use it. - -**Context:** We’ve spent nearly the entire semester solving data driven problems using Python, and now we are going to learn about one of the primary features in Python: classes. Python is an object oriented programming language, and as such, much of Python, and the libraries you use in Python are objects which have attributes and methods. In this project we will explore some of the terminology and syntax relating to classes. This is the second in a series of 3 projects focused on reading and writing classes in Python. - -**Scope:** Python, classes - -.Learning Objectives -**** -- Use classes to solve a data-driven problem. -- Understand and identify attributes and methods of a class. -- Differentiate between class attributes and instance attributes. -- Differentiate between instance methods, class methods, and static methods. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -[source,python] ----- -from collections import Counter - -class Player: - def __init__(self, name, deck): - self.name = name - self.deck = deck - self.hand = [] - - def __str__(self): - return(f""" - {self.name}\n - Top 5 cards: {self.deck[:5]} - """) - - def draw(self): - card = self.deck.cards.pop(0) - self.hand.append(card) - - def has_set(self): - summarizedhand = Counter(self.hand) - for key, value in summarizedhand.items(): - if value >= 3: - return True - return False - - -class Card: - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 11, "q": 12, "k": 13, "a": 14} - def __init__(self, number, suit): - if str(number).lower() not in [str(num) for num in range(2, 11)] + list("jqka"): - raise Exception("Number wasn't 2-10 or J, Q, K, or A.") - else: - self.number = str(number).lower() - if suit.lower() not in ["clubs", "hearts", "diamonds", "spades"]: - raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") - else: - self.suit = suit.lower() - - def __str__(self): - return(f'{self.number} of {self.suit.lower()}') - - def __repr__(self): - return(f'Card(str({self.number}), "{self.suit}")') - - def __eq__(self, other): - if self.number == other.number: - return True - else: - return False - - def __lt__(self, other): - if self._value_dict[self.number] < self._value_dict[other.number]: - return True - else: - return False - - def __gt__(self, other): - if self._value_dict[self.number] > self._value_dict[other.number]: - return True - else: - return False - - def __hash__(self): - return hash(self.number) - - -class Deck: - brand = "Bicycle" - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] - - def __len__(self): - return len(self.cards) - - def __getitem__(self, key): - return self.cards[key] - - def __setitem__(self, key, value): - self.cards[key] = value - - def __str__(self): - return f"A {self.brand.lower()} deck." - ----- - -Recall from the previous project the following. - -Two common patterns that are important to be able to quickly recognize in many gin rummy games are sets and runs. - -A set is a group of cards with different suits but the same value. In order to qualify as a set, there must be 3 or more cards. - -A run is a group of cards with the same suit with sequential values. In order to qualify as a run, there must be 3 or more cards. - -In the final question from the previous project we wrote a method (a function for a class) called `has_set` which returned `True` if the given `Player` had a set or not. This is useful, sure, but not as useful as it could be! - -Write another method called `get_sets` which returns a list of lists, where each nested list contains the cards of a complete set. The results should look something like the following, feel free to run the code many times to see if it looks as if it is working. - -[source,python] ----- -import random - -deck = Deck() -player1 = Player("Alice", deck) -random.shuffle(deck) -for _ in range(20): - player1.draw() - -sets = player1.get_sets() -sets ----- - -.output ----- -[[Card(str(5), "clubs"), Card(str(5), "spades"), Card(str(5), "hearts")], - [Card(str(6), "diamonds"), Card(str(6), "clubs"), Card(str(6), "spades")]] ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -Runs are a bit more complicated to figure out than sets. In order to make things slightly easier, let's write a method called `hand_as_df` that takes a player's hand and converts it into a pandas dataframe with the following columns: `suit`, `numeric_value`, `card`. The first column is just a column with the strings: "spades", "hearts", "diamonds", or "clubs". The second is the numeric value of a given card: 1 through 13. - -[IMPORTANT] -==== -You may want to change your `Card` class so that the value isn't 2-14 but 1-13, where ace is low (1) and only low. -==== - -The final column is the `Card` object itself! - -The following should result in a dataframe. - -[source,python] ----- -import random - -deck = Deck() -player1 = Player("Alice", deck) -random.shuffle(deck) -for _ in range(20): - player1.draw() - -sets = player1.hand_as_df() -sets ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Okay, now for the more challenging part. Write a method called `get_runs` that returns a list of lists where each list contains the cards of the given run. Note that runs of more than 3 should be in the same list. If a run is 6 or more, it should be represented in a single list, not 2 lists of 3 or more. - -You can run the following code until you can see that your method is working as intended. - -[source,python] ----- -import random - -deck = Deck() -player1 = Player("Alice", deck) -random.shuffle(deck) -for _ in range(20): - player1.draw() - -runs = player1.get_runs() -runs ----- - -.example output ----- -[[Card(str(j), "hearts"), Card(str(q), "hearts"), Card(str(k), "hearts")], - [Card(str(a), "spades"), - Card(str(2), "spades"), - Card(str(3), "spades"), - Card(str(4), "spades"), - Card(str(5), "spades")]] ----- - -Since this question is more challenging than normal, this is the last question. Try to solve this puzzle before looking at the tips below! - -[TIP] -==== -Grouping by `suit` would be a good way to isolate cards of a certain suit. Remember runs can only be with cards of the same suit. - -To group by suit and loop through the groups, you can use the `groupby` method. - -[source,python] ----- -for idx, group in my_df.groupby("suit"): - print(idx) # an index - print(group) # a dataframe with only cards from the same suit - print(group.shape) # note that all the regular data frame methods are available to use ----- -==== - -[TIP] -==== -Think about the following values. Consider the `numeric_value` column, and consider how useful the `difference` column is in our situation. Maybe we could do something with that? - -.values ----- -some_column, numeric_value, difference -1, 1, 0 -2, 2, 0 -3, 3, 0 -4, 5, -1 -5, 6, -1 -6, 8, -2 -7, 9, -2 -7, 9, -2 ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project12.adoc deleted file mode 100644 index e2bc9dda8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project12.adoc +++ /dev/null @@ -1,845 +0,0 @@ -= STAT 19000: Project 12 -- Spring 2022 - -**Motivation:** We’d be remiss spending almost an entire semester solving data driven problems in python without covering the basics of classes. Whether or not you will ever choose to use this feature in your work, it is best to at least understand some of the basics so you can navigate libraries and other code that does use it. - -**Context:** We’ve spent nearly the entire semester solving data driven problems using Python, and now we are going to learn about one of the primary features in Python: classes. Python is an object oriented programming language, and as such, much of Python, and the libraries you use in Python are objects which have attributes and methods. In this project we will explore some of the terminology and syntax relating to classes. This is the third in a series of about 3 projects focused on reading and writing classes in Python. - -**Scope:** Python, classes, composition vs. inheritance - -.Learning Objectives -**** -- Use classes to solve a data-driven problem. -- Understand and identify attributes and methods of a class. -- Differentiate between class attributes and instance attributes. -- Differentiate between instance methods, class methods, and static methods. -- Identify when composition is being used or inheritance is being used. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -In this project, we are going to try something a little bit new. Instead of writing lots of code to generate data and work with cards, we will provide you with code and ask you to read, run, and understand code, and have some fun "simulating" card game information. The only reason we have projects based around classes in Python is because you run into them in the wild, and knowing how to navigate such code will benefit you greatly. It may sound tough, but I promise it is not bad at all, and if it is, post on Piazza and we will adjust or add more hints to help. - -Read the introduction parts of https://realpython.com/inheritance-composition-python/[this] excellent article on composition and inheritance. Read enough of the sections to have a general idea about the difference between inheritance and composition -- no need to read the entire article unless you are interested! - -In your opinion, does the snippet of code provided below make use of techniques closer to inheritance or composition? Explain in 1-2 sentences why. - -[NOTE] -==== -Both inheritance and composition can be extremely useful when coding. There are tradeoffs to both, however, composition will most likely be more useful in the data science world. -==== - -[source,python] ----- -import random -from collections import Counter -import pandas as pd -import numpy as np - -class Player: - - def __init__(self, name, strategy): - self.name = name - self.hand = [] - self.strategy = strategy - - def __str__(self): - return self.name - - def draw(self): - self.strategy.draw() - - def discard(self): - self.strategy.discard() - - def can_end_game(self): - return self.strategy.can_end_game(self) - - def should_end_game(self): - return self.strategy.should_end_game(self) - - def make_move(self, game): - return self.strategy.make_move(self, game) - - def get_best_hand(self): - return self.strategy.get_best_hand(self) ----- - -.Items to submit -==== -- Explain in 1-2 sentences why the snippet of code provided makes use of techniques closer to inheritance or composition. -==== - -=== Question 2 - -++++ - -++++ - -Below is the code you may use for this project. There is a lot to unpack, but this is okay, you don't need a perfect understanding of it all to solve the questions in this project. - -[NOTE] -==== -This code is _not_ optimal _at all_. There are tons and tons of improvements that could be made both in design, documentation, readability, etc. It _is_ however, a decent set of code to try and understand. It is inconsistent in comments and docstrings, which is really great to practice on! - -It is also possible that there are bugs -- if you find one, post in Piazza! Your instructors will be super pleased! -==== - -Recall in the previous question's https://realpython.com/inheritance-composition-python/#whats-composition[article] that composition follows a "has a" relationship vs. inheritance's "is a" relationship. - -Take a peek at each of the classes in the code below, and identify 3 "has a" relationships between classes. One example would be a `Game` has a `Deck`. This is shown in the `__init__` function in `Game` where we set `self.deck` to be some provided `deck` object. Why is this pattern useful? Well, if for some reason we wanted to change our game to use a double deck (two decks of cards combined), this would be trivial. We could simply pass that an object that has twice the cards, and make sure that the methods that a regular `Deck` has are also implemented for the supposed `DoubleDeck`, and things should work fine! - -Okay, list out 3 other "has a" relationships between classes. - -[source,python] ----- -import random -from collections import Counter -import pandas as pd -import numpy as np - - -class Card: - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 11, "q": 12, "k": 13, "a": 1} - _gin_value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 10, "q": 10, "k": 10, "a": 1} - def __init__(self, number, suit): - if str(number).lower() not in [str(num) for num in range(2, 11)] + list("jqka"): - raise Exception("Number wasn't 2-10 or J, Q, K, or A.") - else: - self.number = str(number).lower() - if suit.lower() not in ["clubs", "hearts", "diamonds", "spades"]: - raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") - else: - self.suit = suit.lower() - - def __str__(self): - return(f'{self.number} of {self.suit.lower()}') - - def __repr__(self): - return(f'Card(str({self.number}), "{self.suit}")') - - def __eq__(self, other): - if self.number == other.number: - return True - else: - return False - - def __lt__(self, other): - if self._value_dict[self.number] < self._value_dict[other.number]: - return True - else: - return False - - def __gt__(self, other): - if self._value_dict[self.number] > self._value_dict[other.number]: - return True - else: - return False - - def __hash__(self): - return hash(self.number) - - -class Deck: - brand = "Bicycle" - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] - - def __len__(self): - return len(self.cards) - - def __getitem__(self, key): - return self.cards[key] - - def __setitem__(self, key, value): - self.cards[key] = value - - def __str__(self): - return f"A {self.brand.lower()} deck." - - -class Player: - - def __init__(self, name, strategy): - self.name = name - self.hand = [] - self.strategy = strategy - - def __str__(self): - return self.name - - def draw(self): - self.strategy.draw() - - def discard(self): - self.strategy.discard() - - def can_end_game(self): - return self.strategy.can_end_game(self) - - def should_end_game(self): - return self.strategy.should_end_game(self) - - def make_move(self, game): - return self.strategy.make_move(self, game) - - def get_best_hand(self): - return self.strategy.get_best_hand(self) - - def hand_as_df(self, my_cards=None): - if not my_cards: - my_cards = self.hand - - data = {'suit': [], 'numeric_value': [], 'card': []} - for card in my_cards: - data['suit'].append(card.suit) - data['numeric_value'].append(card._value_dict[card.number]) - data['card'].append(card) - - return pd.DataFrame(data=data) - - def get_sets(self, my_cards=None): - - if not my_cards: - my_cards = self.hand - - def _flatten(t): - return [item for sublist in t for item in sublist] - - def _get_cards_with_value(card_with_value, my_cards): - return [card for card in my_cards if card == card_with_value] - - summarized = Counter(my_cards) - sets = [] - for key, value in summarized.items(): - if value > 2: - sets.append(_get_cards_with_value(key, my_cards)) - - set_tuples = [(x._value_dict[x.number], x.suit) for x in _flatten(sets)] - remaining_cards = list(filter(lambda x: (x._value_dict[x.number], x.suit) not in set_tuples, my_cards)) - - return remaining_cards, sets - - def get_runs(self, my_cards=None): - - if not my_cards: - my_cards = self.hand - - def _flatten(t): - return [item for sublist in t for item in sublist] - - # get the hand as a pandas df - df = self.hand_as_df(my_cards) - - # to store complete runs - runs = [] - - # loop through cards by suit - for _, group in df.groupby("suit"): - - # sort the sub dataframe, group, by numeric value - sorted_values = group.sort_values(["numeric_value"]) - - # this is the key. create an auxilliary column that - # is the difference between a column containing a count, - # for example, 1, 2, 3, 4, 5, and the corresponding - # numeric_values. This gives us a value that we can group by - # containing all of the values in a run! - sorted_values['aux'] = np.arange(len(sorted_values['numeric_value'])) - sorted_values['numeric_value'] - - # sub groups here, subdf, will only contain runs now - for _, subdf in sorted_values.groupby('aux'): - - # if the run is more than 2 - if subdf.shape[0] > 2: - - # add the card objects to our list of lists - runs.append(subdf['card'].tolist()) - - run_tuples = [(x._value_dict[x.number], x.suit) for x in _flatten(runs)] - - remaining_cards = list(filter(lambda x: (x._value_dict[x.number], x.suit) not in run_tuples, my_cards)) - - return remaining_cards, runs - - -class Ruleset: - - @staticmethod - def deal(game): - """ - This implementation of deal we will deal - 10 cards each, alternating, starting - with player1. - - Note: We are _not_ using our strategy to - draw cards, but rather just drawing 10 cards - each from the game's deck. - """ - for _ in range(10): - card = game.deck.cards.pop(0) - game.player1.hand.append(card) - - card = game.deck.cards.pop(0) - game.player2.hand.append(card) - - @staticmethod - def first_move(game): - """ - This implementation of first move - will randomly choose a player to start, - that player will draw, discard, etc. - - Afterwords, it will return two values. The - first is a boolean indicating whether or not - to end the game. The second is the player object. - - If the boolean indicates to end the game the player - is the player ending the game, otherwise, it is - the player whose turn is next. - """ - player_to_start = random.choice((game.player1, game.player2)) - return player_to_start.make_move(game) - - -class Strategy: - - @staticmethod - def get_best_hand(player): - - def _flatten(t): - return [item for sublist in t for item in sublist] - - # this strategy is to get the runs then sets in that order, - # count the remaining card values, then reverse the process, - # get the sets then runs in that order, then count remaining - # card values - remaining_1 = player.hand - remaining_1, runs1 = player.get_runs() - remaining_1, sets1 = player.get_sets(remaining_1) - - remaining_card_value_1 = 0 - for card in remaining_1: - remaining_card_value_1 += card._gin_value_dict[card.number] - - remaining_2 = player.hand - remaining_2, sets2 = player.get_sets() - remaining_2, runs2 = player.get_runs(remaining_2) - - remaining_card_value_2 = 0 - for card in remaining_2: - remaining_card_value_2 += card._gin_value_dict[card.number] - - if remaining_card_value_1 <= remaining_card_value_2: - return (remaining_1, _flatten(runs1 + sets1)) - else: - return (remaining_2, _flatten(runs2 + sets2)) - - @staticmethod - def draw(player, game): - # strategy to just always draw the face down card - drawn_card = game.deck.cards.pop(0) - player.hand.append(drawn_card) - - @staticmethod - def discard(self, player, game): - # strategy to discard the highest value card not - # part of a set or a run - - # NOTE: This is a strategy that could be improved. - # What if the highest value card is a king of spades, - # and we also have another remaining card that is the - # king of clubs? - - # NOTE: Another way to improve things would be using "deque" - # https://docs.python.org/3/library/collections.html#collections.deque - # prepending to a list is not efficient. - remaining_cards, complete_cards = self.get_best_hand(player) - remaining_cards = sorted(remaining_cards, reverse=True) - - to_discard = remaining_cards.pop(0) - game.discard_pile.insert(0, to_discard) - - # remove from the player's hand - for idx, card in enumerate(player.hand): - if (card._value_dict[card.number], card.suit) == (to_discard._value_dict[to_discard.number], to_discard.suit): - player.hand.pop(idx) - - @staticmethod - def can_end_game(player): - """ - The rules of gin (our version) state that in order to end the game - the value of the non-set, non-run cards must be at most 10. - """ - remaining_cards, _ = player.get_best_hand() - - remaining_value = 0 - for card in remaining_cards: - remaining_value += card._gin_value_dict[card.number] - - return remaining_value <= 10 - - @staticmethod - def should_end_game(player): - """ - Let's say our strategy is to knock as soon as possible. - - NOTE: Maybe a better strategy would be to knock as soon as - possible if only so many turns have occurred? - """ - - if player.can_end_game(): - return True - else: - return False - - def make_move(self, player, game): - """ - A move always consistents of the same operations. - A players draws, discards, decides whether or not - to end the game. - - This function returns two values. The first is a - boolean value that says whether or not the game - should be ended. The second is the player object - of the individual playing the game. If the player - is not ending the game, the player returned is the - player whose turn it is now. - """ - # first, we must draw a card - self.draw(player, game) - - # then, we should discard - self.discard(self, player, game) - - # next, we should see if we should end the game - if player.should_end_game(): - # then, we end the game - return True, player - else: - # otherwise, return the player with the next turn - return False, (set(game.get_players()) - set((player,))).pop() - - -class Scorecard: - def __init__(self, player1, player2): - self.player1 = player1 - self.player2 = player2 - self.score = pd.DataFrame(data={"winner": [], f"points": []}) - - def __str__(self): - return f'{self.score.groupby("winner").sum()}' - - def stats(self): - pass - - -class Game: - def __init__(self, scorecard, deck, ruleset, player1, player2): - self.scorecard = scorecard - self.deck = deck - self.discard_pile = [] - self.ruleset = ruleset - self.player1 = player1 - self.player2 = player2 - - # shuffle deck - random.shuffle(self.deck) - - def get_players(self): - return (self.player1, self.player2,) - - def play(self): - """ - Play the game until a player ends the game. - """ - # deal cards according to ruleset - self.ruleset.deal(self) - - # first_move should bring the game's state - # to a consistent state. - - # Example 1: use the rule where the most - # recent loser deals 11 cards to the other player - # and the other player begins by discarding 1 card - - # Example 2: use another variant of the "normal" rule where each player - # is dealt 10 cards and then the remaining cards are - # placed face down and the first card is flipped up - # into the discard pile. A player is chosen at random - # and they can start the game by drawing and then discarding - end_game, player = self.ruleset.first_move(self) - - if end_game: - self.end_game(player) - - while not end_game: - if len(self.deck.cards) <= 2: - # reset game in draw - self.reset_game() - - end_game, player = player.make_move(self) - - self.end_game(player) - - - def end_game(self, game_ender): - """ - Ending a game involves the following process: - - 1. If the player ending the game if "going gin", that player - gets 25 points plus the value of the other players remaining - cards. - 2. The other player can add their remaining cards to any of the game ender's sets or runs. - 3. Now, the value of the remaining cards for the player - ending the game are compared to those of the other player, - after the other player has potentially reduced their remaining - cards in step 2. - 4. If the player ending the game has strictly fewer points, - the player ending the game receives the difference between - their remaining cards and the other players remaining cards. - 5. If the player ending the game has equal to or more points, - the player ending the game has been undercut. The other player - receives 25 points plus the difference between their remaining - cards and the other players remaining cards. - """ - - def _flatten(t): - return [item for sublist in t for item in sublist] - - def _get_rid_of_deadwood(game_ender, other_player): - remaining_cards, complete_cards = game_ender.get_best_hand() - other_remaining, other_complete = other_player.get_best_hand() - - combined_remaining1 = other_remaining + complete_cards - combined_remaining1, runs1 = other_player.get_runs(combined_remaining1) - combined_remaining1, sets1 = other_player.get_sets(combined_remaining1) - - combined_remaining2 = other_remaining + complete_cards - combined_remaining2, runs2 = other_player.get_runs(combined_remaining2) - combined_remaining2, sets2 = other_player.get_sets(combined_remaining2) - - remaining_card_value_1 = 0 - for card in combined_remaining1: - remaining_card_value_1 += card._gin_value_dict[card.number] - - remaining_card_value_2 = 0 - for card in combined_remaining2: - remaining_card_value_2 += card._gin_value_dict[card.number] - - if remaining_card_value_1 <= remaining_card_value_2: - # remove the cards used in a set or run from other_remaining - melds = [(x._value_dict[x.number], x.suit) for x in _flatten(runs1) + _flatten(sets1)] - updated_other_remaining = list(filter(lambda x: (x._value_dict[x.number], x.suit) not in melds, other_remaining)) - return updated_other_remaining - else: - melds = [(x._value_dict[x.number], x.suit) for x in _flatten(runs1) + _flatten(sets1)] - updated_other_remaining = list(filter(lambda x: (x._value_dict[x.number], x.suit) not in melds, other_remaining)) - return updated_other_remaining - - # get the "other player" - other_player = (set(self.get_players()) - set((game_ender,))).pop() - - # get both players best hands - remaining_cards, complete_cards = game_ender.get_best_hand() - other_remaining, other_complete = other_player.get_best_hand() - - # is the game ender "going gin"? - if not remaining_cards: - winner = game_ender - points = 25 - for card in other_remaining: - points += card._gin_value_dict[card.number] - - else: - # let the other_player play any deadwood/remaining cards - # they have on the game ender's sets/runs - other_remaining = _get_rid_of_deadwood(game_ender, other_player) - - # compare deadwood - enders_deadwood = 0 - for card in remaining_cards: - enders_deadwood += card._gin_value_dict[card.number] - - other_deadwood = 0 - for card in other_remaining: - other_deadwood += card._gin_value_dict[card.number] - - if enders_deadwood < other_deadwood: - winner = game_ender - points = other_deadwood - enders_deadwood - else: - winner = other_player - points = 25 + (enders_deadwood - other_deadwood) - - # tally score - self.scorecard.score = self.scorecard.score.append({"winner": str(winner), "points": points}, ignore_index=True) - - # get a fresh shuffled deck and clear out hands - self.reset_game() - - def reset_game(self): - # get a fresh shuffled deck and clear out hands - self.deck = Deck() - self.discard_pile = [] - self.player1.hand = [] - self.player2.hand = [] ----- - -.Items to submit -==== -- Identify 3 "has a" relationships between classes, and give a brief explanation (say, 1 sentence) about each of these 3 "has a" relationships. -==== - -=== Question 3 - -++++ - -++++ - -Use the provided code to create the following objects: - -- A `Strategy` object that `player1` and `player2` (see below) will use. -- A `Deck` object for the game. -- A `Ruleset` object for the game. -- A `Player` object called `player1` that represents the first player. -- A `Player` object called `player2` that represents the second player. -- A `Scorecard` object for the game between these two players. -- A `Game` object that uses the objects you've created. - -Once you have your `Game` created, go ahead and play a game using the `play` method! After you've played a game, print the `Scorecard` object you created. Typically Gin is played over and over until one player gets 100 points. Play another game using the `play` method. Print the `Scorecard` object again -- did it change as you would expect? - -.Items to submit -==== -- Show the game play as you test the code. -==== - -=== Question 4 - -++++ - -++++ - -Typically, the way Gin works is you would play a "game" with the other player. The winner would get points. These points are tracked until the first player gets to 100 points. Once that happens, the winner would get a single "set point". You could then track these "set points" over many days/months/years to keep track of who wins the most, etc. Or, you could agree to play until the first person gets to 3 (or any other arbitrary rule). - -If you were to `play` many games of Gin from the previous question, you would notice that the scorecard would just grow and grow. Currently there is not logic added that keeps track of whether or a player has won a set, winning a "set point". - -Write code that simulates a game of Gin that goes until one of the players gets to 3 "set points". Print the final `Scorecard` after each won "set point". Make sure to create a fresh game with a fresh `Scorecard` between each won "set point" (or, if you have another way you'd like to tackle this problem, feel free!). At the end of the simulation, print the final score, for example: - ----- -#... -print(scorecard) -# points -# winner -# David 26.0 -# Kali 50.0 -#... -# code to print final score... -# Final score: -# David: 2 -# Kali: 3 ----- - -This definition of `game_over` might be useful for your work: - ----- -def game_over(scorecard): - winning_scoreboard = scorecard.score.groupby("winner").sum().reset_index().loc[scorecard.score.groupby("winner").sum().reset_index()['points'] >= 100.0, :] - return winning_scoreboard['winner'], winning_scoreboard.shape[0] > 0.0 ----- - -[TIP] -==== -You can access the scorecard as a dataframe by `scorecard.score`. -==== - -.Items to submit -==== -- Show the game play as you test the code. -==== - -=== Question 5 - -++++ - -++++ - -Composition allows us to do one very powerful thing with the code that we've written -- it allows us to quickly adopt and test out different playing strategies. The following is the `Strategy` we provided for you. - -[source,python] ----- -class Strategy: - - @staticmethod - def get_best_hand(player): - - def _flatten(t): - return [item for sublist in t for item in sublist] - - # this strategy is to get the runs then sets in that order, - # count the remaining card values, then reverse the process, - # get the sets then runs in that order, then count remaining - # card values - remaining_1 = player.hand - remaining_1, runs1 = player.get_runs() - remaining_1, sets1 = player.get_sets(remaining_1) - - remaining_card_value_1 = 0 - for card in remaining_1: - remaining_card_value_1 += card._gin_value_dict[card.number] - - remaining_2 = player.hand - remaining_2, sets2 = player.get_sets() - remaining_2, runs2 = player.get_runs(remaining_2) - - remaining_card_value_2 = 0 - for card in remaining_2: - remaining_card_value_2 += card._gin_value_dict[card.number] - - if remaining_card_value_1 <= remaining_card_value_2: - return (remaining_1, _flatten(runs1 + sets1)) - else: - return (remaining_2, _flatten(runs2 + sets2)) - - @staticmethod - def draw(player, game): - # strategy to just always draw the face down card - drawn_card = game.deck.cards.pop(0) - player.hand.append(drawn_card) - - @staticmethod - def discard(self, player, game): - # strategy to discard the highest value card not - # part of a set or a run - - # NOTE: This is a strategy that could be improved. - # What if the highest value card is a king of spades, - # and we also have another remaining card that is the - # king of clubs? - - # NOTE: Another way to improve things would be using "deque" - # https://docs.python.org/3/library/collections.html#collections.deque - # prepending to a list is not efficient. - remaining_cards, complete_cards = self.get_best_hand(player) - remaining_cards = sorted(remaining_cards, reverse=True) - - to_discard = remaining_cards.pop(0) - game.discard_pile.insert(0, to_discard) - - # remove from the player's hand - for idx, card in enumerate(player.hand): - if (card._value_dict[card.number], card.suit) == (to_discard._value_dict[to_discard.number], to_discard.suit): - player.hand.pop(idx) - - @staticmethod - def can_end_game(player): - """ - The rules of gin (our version) state that in order to end the game - the value of the non-set, non-run cards must be at most 10. - """ - remaining_cards, _ = player.get_best_hand() - - remaining_value = 0 - for card in remaining_cards: - remaining_value += card._gin_value_dict[card.number] - - return remaining_value <= 10 - - @staticmethod - def should_end_game(player): - """ - Let's say our strategy is to knock as soon as possible. - - NOTE: Maybe a better strategy would be to knock as soon as - possible if only so many turns have occurred? - """ - - if player.can_end_game(): - return True - else: - return False - - def make_move(self, player, game): - """ - A move always consistents of the same operations. - A players draws, discards, decides whether or not - to end the game. - - This function returns two values. The first is a - boolean value that says whether or not the game - should be ended. The second is the player object - of the individual playing the game. If the player - is not ending the game, the player returned is the - player whose turn it is now. - """ - # first, we must draw a card - self.draw(player, game) - - # then, we should discard - self.discard(self, player, game) - - # next, we should see if we should end the game - if player.should_end_game(): - # then, we end the game - return True, player - else: - # otherwise, return the player with the next turn - return False, (set(game.get_players()) - set((player,))).pop() ----- - -Copy and paste the code above to create your own class `MyStrategy`. Modify the code in `MyStrategy` to do something different. Try to not modify the method arguments or return types, as this will cause the need for more modification. Here are some examples of changes you could make: - -- Modify the `draw` method to check if the top card (card at index 0 of the `discard_pile`) would create a new set or run, and if so, choose to draw from the `discard_pile` instead of the `deck`. -- Modify the `discard` method to not discard a partial set or run -- a set or run of two cards, where you just need 1 more to complete it. -- Modify the `should_end_game` method to only end the game if the player has "deadwood" under a certain value. -- Modify the `should_end_game` method to only end the game if the player has 0 "deadwood" (i.e. if they have Gin, or can "go Gin"). -- Modify the strategy to give a player perfect memory -- i.e. they can remember all of the `discard_pile`, and use this to change the strategy (harder). - -This is really cool, because you could test out, computationally, many different strategies to see what increases your odds of winning! For now, simulate a full game (like in the previous question) where one player has the default strategy, and the other has your new `MyStrategy`. Did the player with your new strategy end up winning? In the next project we will experiment more with your new strategy. - -[IMPORTANT] -==== -Gin is not hard to learn, and it is a game of skill (meaning the odds of winning are not the same for someone with skill as someone without skill, not a game of purely luck. - -https://www.gamecolony.com/gin_rummy_game_online.shtml?done[This] site has a pretty short 1 page explanation of the rules. Here is a quick breakdown of the version we've implemented. - -- Each player is dealt 10 cards. -- A random player is chosen to start the game. -- The first player makes their move. With the default strategy, the player draws the facedown card from the `Deck`. -- The player discards a card face up in the `discard_pile`. -- The next player draws a card from either the `deck` or the `discard_pile` -- the default strategy is to always draw from the `deck`. -- The player then discards a card. -- This repeats until a player decides to end the game. -- A player can end the game by knocking or going gin. -- In order to "go gin", a player must be able to make full sets and/or runs from all 10 cards (note that the 11th card is _always_ discarded). -- If a player goes gin, they get 25 points _plus_ the value of the opponent's cards that _don't_ belong to a run or a set. -- Otherwise a player may choose to end the game by knocking. -- In order to knock, a player must have cards with total value less than or equal to 10 points that _are not_ a part of a set or run (again, with a final of only 10 cards -- we always discard the 11th card before ending the game or at the end of each turn). All other cards must be a part of a set or run. These "remaining cards", or cards that are not a part of a set or a run are called "deadwood". -- The opponent gets the opportunity to add their deadwood onto the knocking player's complete runs or sets. Any deadwood added to the knocker's runs and/or sets is no longer deadwood. -- If the player that knocks has a total value of deadwood less than (strictly) than the total value of the opponent's deadwood, they win the amount of points in the difference between the total value of their deadwood and their opponent's. -- If the player that knocks has equal to or greater total value of deadwood than their opponent, the knocker got _undercut_. The opponent then wins 25 points _plus_ the difference between their deadwood and the knockers. -- The first player to get to 100 points wins a "set point". -- Rinse and repeat until a player has 3 "set points" or until some other predetermined criteria is met. -==== - -.Items to submit -==== -- Show your modified strategy, discuss what you changed, and show how it works. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project13.adoc deleted file mode 100644 index 7abb6a116..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project13.adoc +++ /dev/null @@ -1,295 +0,0 @@ -= STAT 19000: Project 13 -- Spring 2022 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you've learned, and review topics and languages in a generic way. - -**Context:** We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We _may_ even ask a question that asks you to use a tool you haven't used before -- but don't worry, if we do, we will provide you with extra guidance. - -**Scope:** Python, R, bash, unix, computers - -.Learning Objectives -**** -- Use the cumulative knowledge you've built this semester to answer a variety of data-driven questions. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/flights/subset/2008.parquet` -- `/depot/datamine/data/coco/unlabeled2017/000000000008.jpg` -- `/depot/datamine/data/movies_and_tv/imdb.db` - -== Questions - -[IMPORTANT] -==== -Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don't feel limited by one language, you can use different languages to answer different questions (maybe one language would be easier to do something in). If you are feeling bold, you can also try answering the questions using all languages! -==== - -=== Question 1 - -++++ - -++++ - -Read in the file `2008.parquet` into a `pandas` dataframe and convert the column `DepTime` to a datetime. Print the first 50 converted values. They should match the following. - -.results ----- -0 13:43:00 -1 11:25:00 -2 20:09:00 -3 09:03:00 -4 14:23:00 -5 20:24:00 -6 17:53:00 -7 06:22:00 -8 19:44:00 -9 14:53:00 -10 20:30:00 -11 07:08:00 -12 17:49:00 -13 12:17:00 -14 09:54:00 -15 17:58:00 -16 22:10:00 -17 07:40:00 -18 10:11:00 -19 16:12:00 -20 11:24:00 -21 08:24:00 -22 21:12:00 -23 06:41:00 -24 17:13:00 -25 14:14:00 -26 19:15:00 -27 09:29:00 -28 12:14:00 -29 13:18:00 -30 17:35:00 -31 09:04:00 -32 15:55:00 -33 08:07:00 -34 09:26:00 -35 15:44:00 -36 19:31:00 -37 22:06:00 -38 06:58:00 -39 12:54:00 -40 13:43:00 -41 00:13:00 -42 NaT -43 08:40:00 -44 20:58:00 -45 10:35:00 -46 15:29:00 -47 17:13:00 -48 06:29:00 -49 13:15:00 -Name: DepTime, dtype: object ----- - -[TIP] -==== -The `apply` method from the `pandas` library can be useful here. You can also use the string method `zfill` to zero-pad a string. -==== - -[TIP] -==== -The `strptime` codes are here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Use the `PIL` package to loop through the pixels in this image, `/depot/datamine/data/coco/unlabeled2017/000000000008.jpg`, and use f-strings to report the percentage of pixels that are "mostly green", "mostly blue", and "mostly red". The output should look like the following. - -.results ----- -red: 2.66% -green: 9.88% -blue: 87.46% ----- - -[TIP] -==== -To view the image: - -[source,python] ----- -from IPython.display import Image -Image("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg") ----- -==== - -[TIP] -==== -These links should be helpful: - -https://www.nemoquiz.com/python/loop-through-pixel-data/ - -https://stackoverflow.com/questions/6444548/how-do-i-get-the-picture-size-with-pil - -https://datagy.io/python-f-strings/ -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -List the number of titles by year `premiered` in the `imdb.db` database. Don't know SQL? That is 100% fine! Read the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_table.html#pandas.read_sql_table[here], and work with `pandas` dataframes. - -[TIP] -==== -Can't figure out how to go through all of the data without having the kernel crash? That's okay! If you don't want to do this right now, it is okay to simply give the results for the first 10k movies: - -.sample of expected results for first 10k ----- - type -premiered -1892 3 -1893 1 -1894 6 -1895 19 -1896 104 -1897 37 -1898 45 -1899 47 -1900 82 -1901 35 -1902 36 -1903 57 -1904 21 -1905 32 -1906 41 -1907 49 -1908 157 -1909 306 -1910 362 -1911 508 -1912 600 -1913 978 -1914 1225 -1915 1465 -1916 1235 -1917 1200 -1918 1015 -1919 307 -1920 15 -1921 5 -1922 2 -1925 4 -1936 1 ----- -==== - -[TIP] -==== -If you want to process the entire table of the database, great! The key is to use the chunksize argument. This returns an _iterator_ -- something you can loop over. If you set `chunksize=10000`, in each iteration of your loop, the value you are using in your loop will be equal to a dataframe with 10000 rows! Simply _group by_ `premiered`, and count the values. Use `pd.concat`, and sum! The results (a sample, at least): - -.sample of results ----- -premiered -1874.0 1.0 -1877.0 1.0 -1878.0 2.0 -1881.0 1.0 -1883.0 1.0 - ... -2024.0 66.0 -2025.0 14.0 -2026.0 9.0 -2027.0 6.0 -2028.0 3.0 ----- -==== - -[TIP] -==== -Want to use SQL? Okay! You can run sql queries on this database from within a Jupyter Notebook cell. For example: - -[source,ipython] ----- -%load_ext sql -%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db ----- - -[source,ipython] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- - -https://the-examples-book.com/book/sql/aggregate-functions#group-by[This] section will be helpful! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Check out the following two datasets: - -- `/depot/datamine/data/okcupid/filtered/users.csv` -- `/depot/datamine/data/okcupid/filtered/questions.csv` - -How many men (as defined by the `gender2` column) believe and don't believe in ghosts? How about women (as defined by the `gender2` column)? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Get the total dollar amount of liquor sold in the `/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt` dataset. - -[NOTE] -==== -This dataset is about 3.5 GB in size -- this is more than you will be able to load in our Jupyter Notebooks in a `pandas` data frame. You'll have to explore a different strategy to solve this! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project14.adoc deleted file mode 100644 index 25783e5b1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-project14.adoc +++ /dev/null @@ -1,187 +0,0 @@ -= STAT 19000: Project 14 -- Spring 2022 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you've learned, and review topics and languages in a generic way. - -**Context:** We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We _may_ even ask a question that asks you to use a tool you haven't used before -- but don't worry, if we do, we will provide you with extra guidance. - -**Scope:** Python, R, bash, unix, computers - -.Learning Objectives -**** -- Use the cumulative knowledge you've built this semester to answer a variety of data-driven questions. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/airbnb/**/reviews.csv.gz` -- `/depot/datamine/data/election/itcont2022.txt` -- `/depot/datamine/data/death_records/DeathRecords.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -Scan through the `reviews.csv.gz` files in `/depot/datamine/data/airbnb/*` and find the 10 most common `reviewer_name` values. - -[TIP] -==== -The `pathlib` library will be particularly useful here. - -In particular, check out the example(s) in the https://docs.python.org/3/library/pathlib.html#basic-use[basic use] section. The `**` part of `**/*.py` means roughly "every directory and subdirectory, recursively". -==== - -[TIP] -==== -You can read `csv` files directly from `.gz` files using the argument `compression="gzip"`. -==== - -[TIP] -==== -The following is an example of one way you could sum the values of a dictionary. - -[source,python] ----- -d1 = {'a': 1, 'b': 2, 'c': 3} -d2 = {'c': 4, 'd': 5, 'a': 6} - -result = {key: d1.get(key, 0) + d2.get(key, 0) for key in set(d1) | set(d2)} ----- -==== - -[TIP] -==== -Test your code on a few of the `reviews.csv.gz` files before running it for all files. Running it for all files will take around 3 or so minutes. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -After completing question 1, it is likely you have a solid understanding on how the data is organized. Add some logic to your code from question 1 to instead print the 5 most common names _for each country_. - -If your `$HOME` country (haha) is in the list -- do the names sound about right? What kind of bias does this data likely show? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Checkout the newest set of election data `/depot/datamine/data/election/itcont2022.txt`. Let's say we are interested in all entries (rows) that have the word "purdue" in it (of course, this may include entries that don't relate to Purdue University, but we are okay with that error). - -This is around 5 GB of data, and only a small fraction of that has relevant information. In `pandas`, there is not an ergonomic way to check if a row of data has a string in it. This is where knowing how to use _multiple_ tools will come in handy! - -There is a tool called `grep` that can _very_ quickly search large text files for certain text. We will learn more about `grep` (and other useful command line utilities) in STAT 29000. With that being said, why not figure out how to use `grep` to create a subset of data to read into `pandas` that is _already_ filtered -- it isn't too bad! - -Use `grep` to create a subset of data called `my_election_data.txt`. `my_election_data.txt` should contain only the rows that have the word "purdue" in it. `my_election_data.txt` should live in your `$HOME` directory: `/home/purduealias/my_election_data.txt`. - -. Use grep to find only rows with the word "purdue" in them (case _insensitive_). Use _redirection_ to save the output to `$HOME/my_election_data.txt`. -+ -[TIP] -==== -You can use the `-i` flag to make your `grep` search case insensitive -- this means that rows with "Purdue" or "purdue" or "PuRdUe" would be found. -==== -+ -[TIP] -==== -You can run `grep` from within Jupyter Notebooks using the `%%bash` magic. For example, the following would find the word "apple" in a dataset and create a new file called "my_new_file.csv" in my `$HOME` directory. - -[source,python] ----- -%%bash - -grep "apple" /depot/datamine/data/yelp/data/json/yelp_academic_dataset_review.json > $HOME/my_new_file.csv ----- -==== -+ -[TIP] -==== -In order to insert the header line into your newly created file, you can run the following `sed` command directly after your `grep` command. - -[source,bash] ----- -sed -i '1 i\CMTE_ID|AMNDT_IND|RPT_TP|TRANSACTION_PGI|IMAGE_NUM|TRANSACTION_TP|ENTITY_TP|NAME|CITY|STATE|ZIP_CODE|EMPLOYER|OCCUPATION|TRANSACTION_DT|TRANSACTION_AMT|OTHER_ID|TRAN_ID|FILE_NUM|MEMO_CD|MEMO_TEXT|SUB_ID' $HOME/my_election_data.txt ----- -==== -+ -. Use `pandas` to read in your newly created, _much smaller_ dataset, `$HOME/my_election_data.txt`. - -Finally, print the `EMPLOYER`, `NAME`, `OCCUPATION`, and `TRANSACTION_AMT`, for the top 10 donations (by size). - -You may notice that each row represents a single donation. Group the data by the `NAME` column to get the total amount of donation per individual. What is the `NAME` of the top donor? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -What is the average age of death for individuals who were married, single, divorced, widowed, or unknown? - -Further split the data by `Sex` -- do the same patterns hold? Dig in a bit and notice that _how_ we look at the data can make a _very_ big difference! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -It has been a fun year. We hope that you learned something new! - -- Write down 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 19000). -- Write down 3 (or more) of your favorite projects and/or topics you wish you were able to learn _more_ about. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-projects.adoc deleted file mode 100644 index 2440d975e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/19000/19000-s2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= STAT 19000 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$19000-s2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat19000[https://piazza.com/purdue/fall2021/stat19000] - -=== Link - -https://piazza.com/purdue/fall2021/stat19000/home[https://piazza.com/purdue/fall2021/stat19000/home] - -== Syllabus - -See xref:spring2022/logistics/s2022-syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project01.adoc deleted file mode 100644 index 436ab0d78..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project01.adoc +++ /dev/null @@ -1,270 +0,0 @@ -= STAT 29000: Project 1 -- Spring 2022 - -**Motivation:** Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common. - -**Context:** This is the first project in a series of 5 projects focused on web scraping in Python, with a focus on XML. - -**Scope:** Python, XML - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Match XML terms to sections of XML demonstrating a working knowledge of the format. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/otc/hawaii.xml` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -[WARNING] -==== -Please review our updated xref:book:projects:submissions.adoc[submission guidelines] before submitting your project. -==== - -[TIP] -==== -For this project, you may find the questions and solutions of an old project found https://thedatamine.github.io/the-examples-book/projects.html#p01-290[here] useful. -==== - -[TIP] -==== -You should read through the small xref:book:data:xml.adoc[XML section] of the book. -==== - -[TIP] -==== -You should read through the https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[10 minute Pandas tutorial]. -==== - -One of the challenges of XML is that it can be hard to get a feel for how the data is structured -- especially in a large XML file. A good first step is to find the name of the root node. Use the `lxml` package to find and print the name of the root node. - -Interesting! If you took a look at the previous project, you _probably_ weren't expecting the extra `{urn:hl7-org:v3}` part in the root node name. This is because the previous project's dataset didn't have a namespace! Namespaces in XML are a way to prevent issues where a document may have multiple sets of node names that are identical but have different meanings. The namespaces allow them to exist in the same space without conflict. - -Practically what does this mean? It makes XML parsing ever-so-slightly more annoying to perform. Instead of being able to enter XPath expressions and return elements, we have to define a namespace as well. This will be made more clear later. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -XML can be nested -- there can be elements that contain other elements that contain other elements. In the previous question, we identified the root node AND the namespace. Just like in the previous Spring 290 project 1 (linked in the "tip" in question 1), we would like you to find the names of the next "tier" of elements. - -This will not be a copy/paste of the previous solution. Why? Because of the namespace! - -First, try to use the same method from question (2) from https://thedatamine.github.io/the-examples-book/projects.html#p01-290[this project] to find the next tier of names. What happens? - -[source,python] ----- -hawaii.xpath("/document") # won't work -hawaii.xpath("{urn:hl7-org:v3}document") # still won't work with the namespace there ----- - -How do we fix this? We must define our namespace, and reference it in our XPath expression. For example, the following will work. - -[source,python] ----- -hawaii.xpath("/ns:document", namespaces={'ns': 'urn:hl7-org:v3'}) ----- - -Here, we are passing a dict to the namespaces argument. The key is whatever we want to call the namespace, and the value is the namespace itself. For example, the following would work too. - -[source,python] ----- -hawaii.xpath("/happy:document", namespaces={'happy': 'urn:hl7-org:v3'}) ----- - -So, unfortunately, _every_ time we want to use an XPath expression, we have to prepend `namespace:` before the name of the element we are looking for. This is a pain, and unfortunately we cannot just define it once and move on. - -Okay, given this new information, please find the next "tier" of elements. - -[TIP] -==== -There should be 8. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Okay, lucky for you, this XML file is not so big! Use your UNIX skills you refined last semester to print the content of the XML file. You can print the entirety in a `bash` cell if you wish. - -You will be able to see that it contains information about a drug of some sort. - -Knowing now that there are `ingredient` elements in the XML file. Write Python code, and an XPath expression to get a list of all of the `ingredient` elements. Print the list of elements. - -[NOTE] -==== -When we say "print the list of elements", we mean to print the list of elements. For example, the first element would be: - ----- - - - - DIBASIC CALCIUM PHOSPHATE DIHYDRATE - - ----- -==== - -To print an `Element` object, see the following. - -[source,python] ----- -print(etree.tostring(my_element, pretty_print=True).decode('utf-8')) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -++++ - -++++ - -At this point in time you may be wondering how to actually access the bits and pieces of data in the XML file. - -There is data between tags. - -[source,xml] ----- -DIBASIC CALCIUM PHOSPHATE DIHYDRATE ----- - -To access such data from the "name" `Element` (which we will call `my_element` below) you would do the following. - -[source,python] ----- -my_element.text # DIABASIC CALCIUM PHOSPHATE DIHYDRATE ----- - -There is also data tucked away in a tag's attributes. - -[source,xml] ----- - ----- - -To access such data from the "name" `Element` (which we will call `my_element` below) you would do the following. - -[source,python] ----- -my_element.attrib['code'] # O7TSZ97GEP -my_element.attrib['codeSystem'] # 2.16.840.1.113883.4.9 ----- - -The aspect of XML that we are interested in learning about are XPath expressions. XPath expressions are a clear and effective way to extract elements from an XML document (or HTML document -- think extracting data from a webpage!). - -In the previous question you used an XPath expression to find all of the `ingredient` elements, regardless where they were or how they were nested in the document. Let's practice more. - -If you look at the XML document, you will see that there are a lot of `code` attributes. Use `lxml` and XPath expressions to first extract all elements with a `code` attribute. Print all of the values of the `code` attributes. - -Repeat the process, but modify your XPath expression so that it only keeps elements that have a `code` attribute that starts with a capital "C". Print all of the values of the `code` attributes. - -[TIP] -==== -You can use the `.attrib` attribute to access the attributes of an `Element`. It is a dict-like object, so you can access the attributes similarly to how you would access the values in a dictionary. -==== - -[TIP] -==== -https://stackoverflow.com/questions/6895023/how-to-select-xml-element-based-on-its-attribute-value-start-with-heading-in-x/6895629[This] link may help you when figuring out how to select the elements where the `code` attribute must start with "C". -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -The `quantity` element contains a `numerator` and a `denominator` element. Print all of the quantities in the XML file, where a quantity is defined as the value of the `value` attribute of the `numerator` element divided by the value of the `value` attribute of the corresponding `denominator` element. Lastly, print the `unit` (part of the `numerator` element afterwards. - -[TIP] -==== -The results should read as follows: - ----- -1.0 1 -5.0 g -7.6 mg -5.0 g -4.0 g -230.0 mg -4.0 g ----- -==== - -[TIP] -==== -You may need to use the `float` function to convert the string values to floats. -==== - -[TIP] -==== -You can use the `xpath` method on an `Element` object. When doing so, if you want to limit the scope of your XPath expression, make sure to start the xpath with ".//ns:" this will start the search from within the element instead of searching the entire document. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project02.adoc deleted file mode 100644 index b2a4557ef..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project02.adoc +++ /dev/null @@ -1,170 +0,0 @@ -= STAT 29000: Project 2 -- Spring 2022 - -**Motivation:** Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering. - -**Context:** In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from a news site, and parse through our newly scraped data using xpath expressions. - -**Scope:** Python, web scraping, XML - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -You will be extracting your own data from online in this project -- there is no provided dataset. - -== Questions - -=== Question 1 - -++++ - -++++ - -The Washington Post is a very popular news site. Open a modern browser (preferably Firefox or Chrome), and navigate to https://www.washingtonpost.com. - -[NOTE] -==== -Throughout this project, I will be referencing text and tools from Firefox. If you want the easiest experience, I'd recommend using Firefox for at least this project. -==== - -By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website. - -To begin exploring the website structure right click on the webpage and select "View Page Source". This will pull up a page full of HTML. This is the HTML used to render the page. - -Alternatively, if you want to focus on a single element on the web page, for example, an article title, right click on the title and select "Inspect". This will pull up an inspector that allows you to see portions of the HTML. - -Click around on the website and explore the HTML however you like. - -Open a few of the articles shown on the front page of the paper. Note how many of the articles start with some key information like: category, article title, picture, picture caption, authors, article datetime, etc. - -For example: - -https://www.washingtonpost.com/health/2022/01/19/free-n95-masks/ - -image::figure33.webp[Article components, width=792, height=500, loading=lazy, title="Article components"] - -Copy and paste the `header` element that is 1 level nested in the `main` element. Include _just_ the tag with the attributes -- don't include the elements nested within the `header` element. - -Do the same for the `article` element that is 1 level nested in the `main` element (after the `header` element). - -Put the pasted elements in a markdown cell and surround the content with a markdown html code block. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for new data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest. - -For example, given a link to a new Washington Post article, do you think you could isolate the article title by using the `class` attribute, `class="b-l br-l mb-xxl-ns mt-xxs mt-md-l pr-lg-l col-8-lg mr-lg-l"`? Maybe, or maybe not. It looks like those classes are used to structure the size, font, and other parts of the article. In a different article those may change, or maybe they wouldn't be _unique_ within the page (for example, if another element had the same set of classes in the same page). - -Write an XPath expression to isolate the article title, and another XPath expression to isolate the article summary or sub headline. - -[IMPORTANT] -==== -You do _not_ need to test your XPath expression yet, we will be doing that shortly. -==== - -[NOTE] -==== -Remember the goal of the XPath expression is to write it in such a way that we can take _any_ Washington Post article and extract the data we want. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Use the `requests` package to scrape the web page containing our article from questions (1) and (2). Use the `lxml.html` package and the `xpath` method to test out the XPath expressions you created in question (2). Use the expressions to extract the element, then print the _contents_ of the elements (what is between the tags). Did they work? Print the element contents to confirm. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Use your newfound knowledge of XPath expressions, `lxml`, and `requests` to write a function called `get_article_links` that scrapes the home page for The Washington Post, and returns 5 article links in a list. - -There are a variety of ways to do this, however, make sure it is repeatable, and _only_ gets article links. - -[TIP] -==== -The `data-*` attributes are particularly useful for this problem. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -++++ - -++++ - - -Write a function called `get_article_info` that accepts a link to an article as an argument, and prints the information in the following format: - -.Example output ----- -Category: Health -Title: White House to distribute 400 million free N95 masks starting next week -Authors: Lena H. Sun, Dan Diamond -Time: Yesterday at 5:00 a.m. EST ----- - -[IMPORTANT] -==== -Of course, the Time section may change. Just use whatever text they use in the article. It could be an actual date or something like in the example where it said "Yesterday...". -==== - -In a loop, test out the `get_article_info` function with the links that are returned by your `get_article_links` function. - -[TIP] -==== -For the authors part, you may find it very difficult to find a single, repeatable way to extract the authors. The reason is that the authors are listed twice within the article. Consider first finding both groups of authors and _then_ use the first of the 2 resulting elements as a starting point to find the authors. You can use the ".//" to search the current element. -==== - -[TIP] -==== -For the time part, you may find it difficult to print both the "Yesterday at" and the actual time portion. Use a similar trick that you used for the authors. First find the node with "Yesterday at" text, then use that node as a starting point to find the next node that contains the time info. If you are stumped -- make a post in Piazza! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. - -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project03.adoc deleted file mode 100644 index f3e8351e5..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project03.adoc +++ /dev/null @@ -1,330 +0,0 @@ -= STAT 29000: Project 3 -- Spring 2022 - -**Motivation:** Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from https://zillow.com. - -**Context:** In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let's you interact with a browser, selenium. - -**Scope:** python, web scraping, selenium - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -Pop open a browser and visit https://zillow.com. Many websites have a similar interface -- a bold and centered search bar for a user to interact with. - -First, in your browser, type in `34474` into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you've been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see. - -This second option may or may not consistently pop up. For this reason, we've included the relevant HTML below, for your convenience. - -[source,html] ----- -
- -

What type of listings would you like to see?

-
    -
  • -
  • -
  • -
-
----- - -[TIP] -==== -Remember that the _value_ of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value. - -[source,html] ----- -
happy
----- - -You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all `div` elements with the value "happy". - ----- -//div[text()='happy'] ----- -==== - -Use `selenium`, and write Python code that first finds the search bar `input` element. Then, use `selenium` to emulate typing the zip code `34474` into the search bar followed by a press of the enter/return button. - -Confirm your code works by printing the current URL of the page _after_ the search has been performed. What happens? Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the _same_ URL. To confirm this, instead of printing the URL, instead print the HTML after the search. - -[TIP] -==== -To print the HTML of an element using `selenium`, you can use the following code. - -[source,python] ----- -element = driver.find_element_by_xpath("//some_xpath") -print(element.get_attribute("outerHTML")) ----- - -If you don't know what HTML to expect, the `html` element is a safe bet. - -[source,python] ----- -element = driver.find_element_by_xpath("//html") -print(element.get_attribute("outerHTML")) ----- - -Of course, please only print a sample of the HTML -- we don't want to print it all -- that would be a lot! -==== - -[TIP] -==== -Remember, in the background, `selenium` is actually launching a browser -- just like a human would. Sometimes, you need to wait for a page to load before you can properly interact with it. It is highly recommended you use the `time.sleep` function to wait 5 seconds after a call to `driver.get` or `element.send_keys`. -==== - -[TIP] -==== -One downside to selenium is it has some more boilerplate code than, `requests`, for example. Please use the following code to instantiate your `selenium` driver on Brown. - -[source,python] ----- -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -import uuid - -firefox_options = Options() -firefox_options.add_argument("window-size=1920,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("start-maximized") -firefox_options.add_argument("disable-infobars") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") -firefox_options.add_argument('--disable-blink-features=AutomationControlled') - -# Set the location of the executable Firefox program on Brown -firefox_options.binary_location = '/depot/datamine/bin/firefox/firefox' - -profile = webdriver.FirefoxProfile() - -profile.set_preference("dom.webdriver.enabled", False) -profile.set_preference('useAutomationExtension', False) -profile.update_preferences() - -desired = DesiredCapabilities.FIREFOX - -# Set the location of the executable geckodriver program on Scholar -uu = uuid.uuid4() -driver = webdriver.Firefox(log_path=f"/tmp/{uu}", options=firefox_options, executable_path='/depot/datamine/bin/geckodriver', firefox_profile=profile, desired_capabilities=desired) ----- - -Please feel free to "reset" your driver (for example, if you've lost track of "where" it is or you aren't getting results you expected) by running the following code, followed by the code shown above. - -[source,python] ----- -driver.quit() - -# instantiate driver again ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Okay, let's go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results! - -Write Python code that uses `selenium` to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page _after_ the button has been clicked. - -[TIP] -==== -Don't forget, it may be best to put a `time.sleep(5)` after the `click()` method call -- _before_ printing the current URL. -==== - -Uh oh! If you did this correctly, it is likely that the URL is not quite right -- something like: `https://www.zillow.com/homes/_rb/`. By default, this URL will place the nearest city in the search bar -- this is _not_ what we wanted. On the bright side, we _did_ notice (when doing this search manually) that the URL _should_ look like: `https://www.zillow.com/homes/34474_rb/` -- we can just insert our zip code directly in the URL and that should work without any fuss, _plus_ we save some page loads and clicks. Great! - -[NOTE] -==== -If you are paying close attention -- you will find that this is an inconsistency between using a browser manually and using `selenium`. `selenium` isn't saving the same data (cookies and local storage) as your browser is, and therefore doesn't "remember" the zip code you are search for after that intermediate "For sale", "For rent", and "Skip this question" step. Luckily, modifying the URL works better anyways. -==== - -Test out (using `selenium`) that simply inserting the zip code in the URL works as intended. Finding the `title` element and printing the contents should verify quickly that it works as intended. - -[source,python] ----- -element = driver.find_element_by_xpath("//title") -print(element.get_attribute("outerHTML")) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Okay great! Take your time to open a browser to `https://www.zillow.com/homes/34474_rb/` and use the Inspector to figure out how the web page is structured. For now, let's not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards. - -What non `li` element contains the cards in their entirety? Use `selenium` and XPath expressions to extract those elements from the web page. Print the value of the `id` attributes for all of the cards. How many cards was there? (this _could_ vary depending on when the data was scraped -- that is ok) - -[TIP] -==== -You can use the `id` attribute in combination with the `starts-with` XPath function to find these elements, because each `id` starts with the same 4-5 letter prefix. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -++++ - -++++ - -Write code to print the mean price of each of the cards on the page, as well as the mean square footage. Print the values. - -[CAUTION] -==== -Uh oh! Once again, something is not working right. If you were to dig in, you'd find that only about 10 or so cards contain their data. This is because the cards are lazy-loaded. What this means is that you must _scroll_ in order for the rest of the info to show up. You can verify this if you scroll super fast. You'll notice even if the page was loaded for 10 seconds, that content at the bottom will take a second to load after scrolling fast. - -To fix this problem -- we need to scroll! Try the following code. Of course, fill in the `find_element_by_xpath` method call with the correct XPath expression (for both calls). You'll notice that _before_ we scroll the 10th element will not contain the data we are looking for, but _after_ our scrolling it will! Super cool! - -[source,python] ----- -from selenium.common.exceptions import StaleElementReferenceException - -cards = driver.find_elements_by_xpath("...") -print(cards[30].get_attribute("outerHTML")) - -# Let's load every 2 cards or so at a time -for idx, card in enumerate(cards): - if idx % 2 == 0: - try: - driver.execute_script('arguments[0].scrollIntoView();', card) - time.sleep(2) - - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue - -cards = driver.find_elements_by_xpath("...") -print(cards[30].get_attribute("outerHTML")) ----- -==== - -[TIP] -==== -Your project writer is mean. Of course not every card contains a house -- some of it is land. Unfortunately, land doesn't have a square footage on the website! Do something similar to the following to skip over those annoying plots of land. (and don't forget to fill in the xpaths) - -[source,python] ----- -from selenium.common.exceptions import NoSuchElementException -import sys -import re - -prices = [] -sq_ftgs = [] -for ct, card in enumerate(cards): - try: - sqft = card.find_element_by_xpath("...").text - sqft = re.sub('[^0-9.]', '', sqft) - - # if there isn't any sq footage skip the card entirely - if sqft == '': - continue - - price = card.find_element_by_xpath("...").text - price = re.sub('[^0-9.]', '', price) - - # if there isn't any price skip the card entirely - if price == '': - continue - - sq_ftgs.append(float(sqft)) - prices.append(float(price)) - - except NoSuchElementException: - # verify that it is a plot of land, if not, panic - is_lot = 'land' in card.find_element_by_xpath(".//ul[@class='list-card-details']/li[2]").text.lower() - if not is_lot: - print("NOT LAND") - print(card.find_element_by_xpath(".//ul[@class='list-card-details']/li[2]").text) - sys.exit(0) - else: - continue - -print(sum(prices)/len(prices)) -print(sum(sq_ftgs)/len(sq_ftgs)) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Update your code from question (4) to first filter the homes by the number of bedrooms and bathrooms. Let's look at some bigger homes. Filter to get houses with 4+ bedrooms and 3+ bathrooms. Recalculate the mean price and square footage for said houses. Print the values. - -[TIP] -==== -To apply said filters, you will need to emulate 3 clicks. One to activate the menu of filters, another to select the number of bedrooms, and another to select the number of bathrooms. You should be able to use a combination of element type (div/button/span/etc.) and attributes to accomplish this. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 (optional, 0 pts) - -Package your code up into a function that let's you choose the zip code, number of bedrooms, and number of bathrooms. Experiment with the function for different combinations and print your results. If you really want to have some fun create an interesting graphic to show your results. - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project04.adoc deleted file mode 100644 index 192c77530..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project04.adoc +++ /dev/null @@ -1,302 +0,0 @@ -= STAT 29000: Project 4 -- Spring 2022 - -**Motivation:** Learning to scrape data can take time. We want to make sure you get comfortable with it! For this reason, we will continue to scrape data from Zillow to answer various questions. This will allow you to continue to get familiar with the tools, without having to re-learn everything about the website of interest. - -**Context:** This is the second to last project on web scraping, where we will continue to focus on honing our skills using `selenium`. - -**Scope:** Python, web scraping, selenium, matplotlib/plotly - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use the beautifulsoup4 package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -If you struggled with Project 3, be sure to check out the solutions for Project 3! They will be posted early Monday, February 7, at the end of each problem, and may be helpful. -==== - -Before we get started, the following is the boiler plate code that we provided you in the previous project to help you get started. You may use this again for this project. - -[source,python] ----- -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -import uuid - -firefox_options = Options() -firefox_options.add_argument("window-size=1920,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("start-maximized") -firefox_options.add_argument("disable-infobars") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") -firefox_options.add_argument('--disable-blink-features=AutomationControlled') - -# Set the location of the executable Firefox program on Brown -firefox_options.binary_location = '/depot/datamine/bin/firefox/firefox' - -profile = webdriver.FirefoxProfile() - -profile.set_preference("dom.webdriver.enabled", False) -profile.set_preference('useAutomationExtension', False) -profile.update_preferences() - -desired = DesiredCapabilities.FIREFOX - -# Set the location of the executable geckodriver program on Scholar -uu = uuid.uuid4() -driver = webdriver.Firefox(log_path=f"/tmp/{uu}", options=firefox_options, executable_path='/depot/datamine/bin/geckodriver', firefox_profile=profile, desired_capabilities=desired) ----- - -In addition, the following is a function that will create a zillow link from a given search text. This is almost like using the search bar on the home page. This function accepts a search string and returns a link to the results. - -[source,python] ----- -def search_link(text: str) -> str: - """ - Given a string of search text, return a link - that is essentially the same result as - using Zillow search. - """ - return f"https://www.zillow.com/homes/{text.replace(' ', '-')}_rb/" ----- - -In this project, when we say "search for 47906", we mean start scraping using `selenium` in the following manner. - -[source,python] ----- -driver.get(search_link("47906")) ----- - -Write a function called `next_page` that will accept your `driver` and returns a string with the URL to the next page if it exists, and `False` otherwise. - -Test your function in the following way. Results should closely match the results below, but may change as listings are constantly being updated. - -[source,python] ----- -driver.get("https://www.zillow.com/austin-tx/17_p/") -print(next_page(driver)) - -driver.get(next_page(driver)) -print(next_page(driver)) ----- - -.Output ----- -https://www.zillow.com/austin-tx/18_p/ -False ----- - -[IMPORTANT] -==== -There may be more or fewer pages of results when you do this project. Change the starting page from "17_p" to the second to last page of results so that we can test that your function returns `False` when there is not a next page. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Search for 47906 and return the median price of the default listings that appear. Unlike in the previous project where we found the mean, this time, be sure to include _all_ pages of listings. The function you wrote from the previous question could be useful! - -[TIP] -==== -There are a _lot_ of ways you can solve this problem. -==== - -[TIP] -==== -Don't forget to scroll so that the cards load up properly! You can use the following function to make sure the driver scrolls through the page so cards are loaded up. - -[source,python] ----- -from selenium.common.exceptions import StaleElementReferenceException - -def load_cards(driver): - """ - Given the driver, scroll through the cards - so that they all load. - """ - cards = driver.find_elements_by_xpath("//article[starts-with(@id, 'zpid')]") - for idx, card in enumerate(cards): - if idx % 2 == 0: - try: - driver.execute_script('arguments[0].scrollIntoView();', card) - time.sleep(2) - - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue ----- -==== - -[TIP] -==== -On 2/2/2022, the result was $152000. -==== - -[TIP] -==== -To get the median of a list of values, you can use: - -[source,python] ----- -import statistics -statistics.median(list_of_values) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Compare median values for (each of) 3 different locations, and use `plotly` to create a plot showing the 3 median prices in these 3 locations. Make sure your plot is well-labeled. - -[TIP] -==== -It may help to pack the solution to the previous question into a clean function. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -You may or may not have noticed, however, you can access the home or plot of land details by appending the `zpid` at the end of the URL. For example, if the card had a `zpid` of `50630217`, we could navigate to https://www.zillow.com/homedetails/50630217_zpid/ and be presented with the details of the property with that `zpid`. - -You can extract the `zpid` from the `id` attribute of the cards. - -Write a function called `get_history` that accepts the driver and a `zpid` (like 50630217) and returns a `pandas` DataFrame with a column `date` and column `price`, with a single row entry for each item in the "Price history" section on Zillow. - -The following is an example of the expected output -- if your solution doesn't match exactly, that is okay and could be the result of the house changing. - -[source,python] ----- -get_history(driver, '2900086') ----- - -.Output ----- -date price -0 2022-01-05 1449000.0 -1 2021-12-08 1499000.0 -2 2021-07-27 1499000.0 -3 2021-04-16 1499000.0 -4 2021-02-12 1599000.0 -5 2006-05-22 NaN -6 1999-06-04 NaN ----- - -To help get you started, here is a skeleton function for you to fill in. - -[source,python] ----- -def get_history(driver, zpid: str): - """ - Given the driver and a zpid, return a - pandas dataframe with the price history. - """ - # get the details page and wait for 5 seconds - driver.get(f"https://www.zillow.com/homedetails/{zpid}_zpid/") - time.sleep(5) - - # get the price history table -- it is always the first table - price_table = driver.find_element_by_xpath("//table") - - # get the dates - dates = price_table.find_elements_by_xpath(".//FILL HERE") - dates = [d.text for d in dates] - - # get the prices - prices = price_table.find_elements_by_xpath(".//FILL HERE") - - # remove extra percentage data, remove non numeric data from prices - prices = [re.sub("[^0-9]","", p.text.split(' ')[0]) for p in prices] - - # create the dataframe and convert types - dat = pd.DataFrame(data={'date': dates, 'price': prices}) - dat['price'] = pd.to_numeric(dat['price']) - dat['date'] = dat['date'].astype('datetime64[ns]') - - return dat ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Write a function called `show_me_plots` that accepts the driver and a "search string" and displays a `plotly` plot with 4 subplots, each containing a plot of the price history of 4 random properties from the _complete_ search results (meaning any 4 properties from all of the pages of results could potentially be plotted. - -Test out your function on a couple of search strings! - -[TIP] -==== -https://plotly.com/python/subplots/ has some examples of subplots using plotly. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project05.adoc deleted file mode 100644 index 1aef6f5d9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project05.adoc +++ /dev/null @@ -1,230 +0,0 @@ -= STAT 29000: Project 5 -- Spring 2022 - -**Motivation:** This project we are continuing to focus on your web scraping skills, introducing you to some common deceptions, and give you the chance to apply what you've learned to something you are interested in. - -**Context:** This is the last project focused on web scraping. -We have created a few problmatic situations that can come up when you are first learning to scrape data. Our goal is to share some tips on how to get around the issue. - -**Scope:** Python, web scraping - -.Learning Objectives -**** -- Use the requests package to scrape a web page. -- Use the lxml/selenium package to filter and parse data from a scraped web page. -- Learn how to step around header-based filtering. -- Learn how to handle rate limiting. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -We have setup 4 special URLs for you to use: https://static.tdm.wiki, https://ip.tdm.wiki, https://header.tdm.wiki, https://sheader.tdm.wiki. Each website uses different methods to _rate limit_ traffic. - -Rate limiting is an issue that comes up often when scraping data. When a website notices that the pages are being navigated faster than humans are able to, the speed causes the website to throw a red flag to the website host that some one could be scraping their content. This is when the website will introduce a rate limit to prevent web scraping. - -If you are able to open a browser and navigate to https://sheader.tdm.wiki. Once there you should be presented with some basic information about the request. - - Now let us check to see what happens if you open up your Jupyter notebook, import the `requests` package and scrape the webpage. - -You _should_ be presented with HTML that indicates your request was blocked. - -https://sheader.tdm.wiki is designed to block all requests where the User-Agent header has "requests" in it. By default, the `requests` package will use the User-Agent header with a value of "python-requests/2.26.0", which has "requests" in it. - -Backing up a little bit, _headers_ are part of your _request_. In general, you can think of headers as some extra data that gives the server or client some context about the request. You can read about headers https://developer.mozilla.org/en-US/docs/Glossary/Request_header[here]. You can find a list of the various headers https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers[here]. - -Each header has a purpose. One common header is called https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[User-Agent]. A user-agent is something like: - ----- -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 ----- - -From the Mozilla link, this header is a string that "lets servers and network peers identify the application, operating system, and browser that is requesting a resource." Basically, if you are browsing the internet with a browser like Firefox or Chrome, the server will know which browser you are using. In the provided example, we are using Firefox 86 from Mozilla, on a Mac running Mac OS 10.16 with an Intel processor. - -When making a request using the `requests` package, the following is what the headers look like. - -[source,python] ----- -import requests - -response = requests.get("https://sheader.tdm.wiki") -print(response.request.headers) ----- - -.Output ----- -{'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'} ----- - -As you can see, the User-Agent header has the word "requests" in it, so it will be blocked. - -You can set the headers to be whatever you'd like using the `requests` package. Simply pass a dictionary containing the headers you'd like to use to the `headers` argument of `requests.get`. Modify the headers so you are able to scrape the response. Print the response using the following code. - -[source,python] ----- -my_response = requests.get(...) -print(my_response.text) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Navigate to https://header.tdm.wiki. Refresh the page a few times, as you do you will notice how the "Cf-Ray" header is changes. -Write a function called `get_ray` that accepts a url as an argument, and scrapes the _value_ of the Cf-Ray header and return the text. - -Run the following code. - -[source,python] ----- -for i in range(6): - print(get_ray('https://header.tdm.wiki')) ----- - -What happens then? Now pop open the webpage in a browser and refresh the page 6 times in rapid succession, what do you see? - -Run the following code again, but this time use a different header. - -[source,python] ----- -for i in range(6): - print(get_ray('https://header.tdm.wiki'), headers={...}) ----- - -This website is designed to adapt and block requests if they have the same header and make requests too quickly. Create a https://github.com/tamimibrahim17/List-of-user-agents[list] of valid user agents and modify your code to utilize them to get 50 "Cf-Ray" values rapidly (in a loop). - -[TIP] -==== -You may want to modify `get_ray` to accept a `headers` argument. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Navigate to https://ip.tdm.wiki. This page is designed to allow only 5 requests every minute from a single IP address. To verify that this is true go ahead and rapidly refresh the page 6 times in a row, then (without wifi) try to load the page on your cell phone immediately after. You will notice that the cell phone loads, but the browser doesn't. - -IP blocking is one of the most common ways to block traffic. Websites will monitor web activity and use complicated algorithms to block IP addresses that appear to be scraping data. The solution will be that we need to scrape content at a certain pace, or figure out a way to use different IP addresses. - -Simply scraping content at a certain pace will not work. Unfortunately even if we randomize periods of time between scraping values, algorithms that are used are clever. - -The best way to bypass IP blocking is to use a different IP address. We can accomplish this by using a proxy server. A proxy server is another computer that will pass the request on for you. The relayed request is now made from behind the proxy servers IP address. - -The following code attempts to scrape some free proxy servers. - -[source,python] ----- -import lxml.html - -def get_proxies(): - url = "https://www.sslproxies.org/" - resp = requests.get(url) - root = lxml.html.fromstring(resp.text) - trs = root.xpath("//tr") - proxies_aux = [] - for e in trs[1:]: - ip = e.xpath(".//td")[0].text - port = e.xpath(".//td")[1].text - proxies_aux.append(f"{ip}:{port}") - - proxies = [] - for proxy in proxies_aux[:25]: - proxies.append({'http': f'http://{proxy}', 'https': f'http://{proxy}'}) - - return proxies ----- - -Play around with the code and test proxy servers out until you find one that works. The following code should help get you started. - -[source,python] ----- -p = get_proxies() -resp = requests.get("https://ip.tdm.wiki", proxies=p[0], verify=False, headers={'User-Agent': f"{my_user_agents[0]}"}, timeout=15) -print(resp.text) ----- - -A couple of notes: - -- `timeout` is set to 15 seconds, because it is likely the proxy will not work if it takes longer than 15 seconds to respond. -- We set a user-agent header so some proxy servers won't automatically block our requests. - -You can stop once you receive and print a successful response. As you will see, unless you pay for a working set of proxy servers, it is very difficult to combat having your IP blocked. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Test out https://static.tdm.wiki. This page is designed to only allow x requests per period of time, regardless of the IP address or headers. - -Write code that scrapes 50 Cf-Ray values from the page. If you attempt to scrape them too quickly, you will get an error. Specifically, `response.status_code` will be 429 instead of 200. - -[source,python] ----- -resp = requests.get("https://static.tdm.wiki") -resp.status_code # will be 429 if you scrape too quickly ----- - -Different websites have different rules, one way to counter this defense is by exponential backoff. Exponential backoff is a system whereby you scrape a page until you receive some sort of error, then you wait x seconds before scraping again. Each time you receive an error, the wait time increases exponentially. - -There is a really cool package that does this for us! Use the https://pypi.org/project/backoff/[backoff] package to accomplish this task. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -For full credit you can choose to do either option 1 or option 2. - -**Option 1:** Figure out how many requests (_r_) per time period (_p_) you can make to https://static.tdm.wiki. Keep in mind that the server will only respond to _r_ requests per time period (_p_) -- this means that fellow students requests will count towards the quota. Figure out _r_ and _p_. Answers do not need to be exact. - -**Option 2:** Use your skills to scrape data from a website we have not yet scraped. Once you have the data create something with it, you can create a graphic, perform some sort of analysis etc. The only requirement is that you scrape at least 100 "units". A "unit" is 1 thing you are scraping. For example, if scraping baseball game data, I would need to scrape the height of 100 players, or the scores of 100 games, etc. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project06.adoc deleted file mode 100644 index 5522d2a0d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project06.adoc +++ /dev/null @@ -1,116 +0,0 @@ -= STAT 29000: Project 6 -- Spring 2022 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we are going to dive into plotting using `matplotlib` with an open project. - -**Context:** We've been working hard all semester and learning a lot about web scraping. In this project we are going to ask you to examine some plots, write a little bit, and use your creative energies to create good visualizations about the flight data using the go-to plotting library for many, `matplotlib`. In the next project, we will continue to learn about and become comfortable using `matplotlib`. - -**Scope:** Python, matplotlib, visualizations - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/flights/subset/*.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -[WARNING] -==== -When submitting your .ipynb file for this project, if the .ipynb file doesn't render in Gradescope, please export the notebook as a PDF and submit that as well -- you will be helping the graders a lot! -==== - -http://stat-computing.org/dataexpo/2009/posters/[Here] is the description of the 2009 Data Expo poster competition. The object of the competition was to visualize interesting information from the flights dataset. - -The winner of the competition were: - -- First place: https://llc.stat.purdue.edu/airline/wicklin-allison.pdf[Congestion in the sky: Visualising domestic airline traffic with SAS (pdf, 550k)] Rick Wicklin and Robert Allison, SAS Institute. - -- Second place: https://llc.stat.purdue.edu/airline/hofmann-cook.pdf[Delayed, Cancelled, On-Time, Boarding ... Flying in the USA (pdf, 13 meg)] Heike Hofmann, Di Cook, Chris Kielion, Barret Schloerke, Jon Hobbs, Adam Loy, Lawrence Mosley, David Rockoff, Yuanyuan Huang, Danielle Wrolstad and Tengfei Yin, Iowa State University. - -- Third place: https://llc.stat.purdue.edu/airline/wickham.pdf[A tale of two airports: An exploration of flight traffic at SFO and OAK. (pdf, 770k)] Charlotte Wickham, UC Berkeley. - -- Honourable mention: https://llc.stat.purdue.edu/airline/dey-phillips-steele.pdf[Minimizing the Probability of Experiencing a Flight Delay (pdf, 7 meg)] Tanujit Dey, David Phillips and Patrick Steele, College of William & Mary. - -The other posters were: - -- https://llc.stat.purdue.edu/airline/kane-emerson.pdf[The Airline Data Set... What's the big deal? (pdf, 80k)] Michael Kane and Jay Emerson, Yale. - -- https://llc.stat.purdue.edu/airline/sun.pdf[Make a Smart Choice on Booking Your Flight! (pdf, 2 meg)] Yu-Hsiang Sun, Case Western Reserve University. - -- https://llc.stat.purdue.edu/airline/crotty.pdf[Airline Data for Raleigh-Durham International] Michael T. Crotty, SAS Institute Inc. - -- https://llc.stat.purdue.edu/airline/jiang.pdf[What Airlines Would You Avoid for Your Next Flight?] Haolai Jiang and Jung-Chao Wang, Western Michigan University. - -Examine all 8 posters and write a single sentence for each poster with your first impression(s). An example of an impression that will not get full credit would be: "My first impression is that this poster is bad and doesn't look organized.". An example of an impression that will get full credit would be: "My first impression is that the author had a good visualization-to-text ratio and it seems easy to follow along.". - -.Items to submit -==== -- 8 bullets, each containing a sentence with the first impression of the 8 visualizations. Order should be "first place", to "honourable mention", followed by "other posters" in the given order. Or, label which graphic each sentence is about. -==== - -=== Question 2 - -https://www.amazon.com/dp/0985911123/[Creating More Effective Graphs] by Dr. Naomi Robbins and https://www.amazon.com/dp/0963488414/[The Elements of Graphing Data] by Dr. William Cleveland at Purdue University, are two excellent books about data visualization. Read the following excerpts from the books (respectively), and list 2 things you learned, or found interesting from each book. - -- https://thedatamine.github.io/the-examples-book/files/CreatingMoreEffectiveGraphs.pdf[Excerpt 1] -- https://thedatamine.github.io/the-examples-book/files/ElementsOfGraphingData.pdf[Excerpt 2] - -.Items to submit -==== -- Two bullets for each book with items you learned or found interesting. -==== - -=== Question 3 - -Of the 7 posters with at least 3 plots and/or maps, choose 1 poster that you think you could improve upon or "out plot". Create 4 plots/maps that either: - -. Improve upon a plot from the poster you chose, or -. Show a completely different plot that does a good job of getting an idea or observation across, or -. Ruin a plot. Purposefully break the best practices you've learned about in order to make the visualization misleading. (limited to 1 of the 4 plots) - -For each plot/map where you choose to do (1), include 1-2 sentences explaining what exactly you improved upon and how. Point out some of the best practices from the 2 provided texts that you followed. - -For each plot/map where you choose to do (2), include 1-2 sentences explaining your graphic and outlining the best practices from the 2 texts that you followed. - -For each plot/map where you choose to do (3), include 1-2 sentences explaining what you changed, what principle it broke, and how it made the plot misleading or worse. - -While we are not asking you to create a poster, please use Jupyter notebooks to keep your plots, code, and text nicely formatted and organized. The more like a story your project reads, the better. In this project, we are restricting you to use `matplotlib` in Python. While there are many interesting plotting packages like `plotly` and `plotnine`, we really want you to take the time to dig into `matplotlib` and learn as much as you can. - -.Items to submit -==== -- All associated Python code you used to wrangling the data and create your graphics. -- 4 plots (and the Python code to produce the plots). -- 1-2 sentences per plot explaining what exactly you improved upon, what best practices from the texts you used, and how. If it is a brand new visualization, describe and explain your graphic, outlining the best practices from the 2 texts that you followed. If it is the ruined plot you chose, explain what you changed, what principle it broke, and how it made the plot misleading or worse. -==== - -=== Question 4 - -Now that you've been exploring data visualization, copy, paste, and update your first impressions from question (1) with your updated impressions. Which impression changed the most, and why? - -.Items to submit -==== -- 8 bullets with updated impressions (still just a sentence or two) from question (1). -- A sentence explaining which impression changed the most and why. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project07.adoc deleted file mode 100644 index 3bf64a1bc..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project07.adoc +++ /dev/null @@ -1,138 +0,0 @@ -= STAT 29000: Project 7 -- Spring 2022 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we will create plots using the `plotly` package, as well as do some data manipulation using `pandas`. - -**Context:** This is the second project focused around creating visualizations in Python. - -**Scope:** plotly, Python, pandas - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/disney/total.parquet` -- `/depot/datamine/data/disney/metadata.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -Read in the data from the parquet file `/depot/datamine/data/disney/total.parquet` and store it in a variable called `dat`. Do the same for `/depot/datamine/data/disney/metadata.csv` and call it `meta`. - -Plotly express makes it really easy to create nice, clean graphics, and it integrates with `pandas` superbly. You can find links to all of the plotly express functions on https://plotly.com/python/plotly-express/[this] page. - -Let's start out simple. Create a bar chart for the total number of observations for each ride. Make sure your plot has labels for the x axis, y axis, and overall plot. - -[WARNING] -==== -While the default plotly plots look amazing and have great interactivity, they won't render in your notebook well in Gradescope. For this reason, please use `fig.show(renderer="jpg")` for all of your plots, otherwise they will not show up in gradescope and you will not get full credit. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Great! Wouldn't it be interesting to see how the total number of observations changes over time, by ride? - -Create a single plot that contains a bar plot for all of the rides, with the total number of observations for each ride by year. - -[TIP] -==== -https://plotly.com/python/bar-charts/[This] page has a good example of making facetted subplots. -==== - -[TIP] -==== -First, create a new column called `year` based on the `year` from the `datetime` column. - -Next, group by both the `ride_name` and `year` columns. Use the `count` method to get the total number of observations for each combination of `ride_name` and `year`. After that, use the `reset_index` method so that both `ride_name` and `year` become columns again (instead of indices). - -The x axis should be the `year`, y axis could be `datetime` (which actually contains the _count_ of observations), the color argument should be `year`, `facet_col` should be `ride_name`, and you can limit the number of plots per column by specifying `facet_col_wrap` to be 4 (for 4 plots per row). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Create a plot that shows the association between the average `SPOSTMIN` and `WDWMAXTEMP` for a ride of your choice. Some notes to help you below. - -. Create a new column in `dat` called `day` that is the date. Make sure to use `pd.to_datetime` to convert the date to the correct type. -. Use the `groupby` method to group by both `ride_name` and `day`. Get the average by using the `mean` method on your grouped data. In order to make `ride_name` and `day` columns instead of indices, call the `reset_index` method. Finally, use the `query` method to subset your data to just be data for your given ride. -. Convert the `DATE` column in `meta` to the correct type using `pd.to_datetime`. -. Use the `merge` method to merge your grouped data with the metadata on `day` (from the grouped data) and `DATE` (from `meta`). -. Make the scatterplot. - -Is there an obvious relationship between the two variables for your chosen ride? Did you expect the results? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -This is an extremely rich dataset with lots and lots of plotting potential! In addition, there are a lot of interesting questions you could ask about wait times and rides that could actually be useful if you were to visit Disney! - -Create a graphic using a plot we have not yet used from https://plotly.com/python/plotly-express/[this] webpage. Make sure to use proper labels, and make sure the graphic shows some sort of _potentially_ interesting relationship. Write 1-2 sentences about why you decided to create this plot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Ask yourself a question regarding the information in the dataset. For example, maybe you think that certain events from the `meta` dataframe will influence a certain ride. Perhaps you think the time the park opens is relevant to the time of year? Write down the question you would like to answer using a plot. Choose the type of plot you are going to use, and write 1-2 sentences explaining your reasoning. Create the plot. What were the results? Was the plot an effective way to answer your question? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project08.adoc deleted file mode 100644 index 903da22cc..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project08.adoc +++ /dev/null @@ -1,409 +0,0 @@ -= STAT 29000: Project 8 -- Spring 2022 - -**Motivation:** Python is an https://www.geeksforgeeks.org/internal-working-of-python/[interpreted language] (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R -- also an interpreted language), we can run and evaluate a line of code easily using a https://en.wikipedia.org/wiki/Read-eval-print_loop[repl]. In fact, this is the way you've been using Python to date -- selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI's (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks. - -**Context:** This is the first (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** Python - -.Learning Objectives -**** -- Write a python script that accepts user inputs and returns something useful. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/coco/unlabeled2017/*.jpg` - -== Questions - -=== Question 1 - -++++ - -++++ - -Up until this point, we have been using Python from within the Jupyter Lab environment. It is pretty straightforward to use -- type in some code into a cell and click run. This is similar to using a REPL (read eval print loop), in that you are essentially running one or more lines of Python code at once. Another way to run Python code is in a script form. A Python script is a `.py` file with Python code written inside of it that performs a series of actions. - -Python scripts are easy to use and run. For example, if you had a script called `my_script.py`, you could run the script by executing the following in a terminal. - -[source,bash] ----- -python3 my_script.py ----- - -This would use the first `python3` interpreter in your `$PATH` environment variable to run the script. - -Read https://realpython.com/python-main-function/[this article] about the `main` function in Python scripts. Also, please read https://realpython.com/run-python-scripts/#using-the-script-filename[this section], paying special attention to the shebang. Lastly, read xref:book:unix:scripts.adoc#shebang[this section]. - -In a Python cell in your notebook, determine which Python interpreter is being used by running the following code. - -[source,python] ----- -import sys -print(sys.executable) ----- - -[NOTE] -==== -If you want to find where Python looks for packages, you can use the following code. - -[source,python] ----- -import sys -print(sys.path) ----- - -Python will look for packages in those locations, in order. -==== - -Your output should read: `/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python`. This is the absolute path to the Python interpreter we use with this course. - -Create a Python script called `question01.py` in your `$HOME` directory with the following content. - -.question01.py -[source,python] ----- -import sys - -def main(): - print(f"{sys.executable}") - -if __name__ == "__main__": - main() ----- - -Open a terminal in Jupyter Lab, and run the following command to give execute permissions to the script. - -[source,bash] ----- -chmod +x $HOME/question01.py ----- - -Finally, in the terminal, execute the script by running the following command. - -[source,bash] ----- -python3 $HOME/question01.py ----- - -In addition, execute the same command in a bash cell in your notebook. - -[source,ipython] ----- -%%bash - -python3 $HOME/question01.py ----- - -What is the output? Was this expected? Write 1-2 sentences explaining why the output was expected or not. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- The `question01.py` script. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -In the previous question, we ran the script, `question01.py`, by passing it as _input_ to our `python3` interpreter. We learned that, depending on the first `python3` interpreter in our `$PATH` environment variable, the output would differ. There is another way to run Python scripts that don't pass our script as an input. The way to do this is with the shebang. You should have read about the shebang in the linked articles in quesiton (1). - -If a shebang is included in the first line of your Python script, there is no longer a need to pass the script as input. Instead, you can simply execute the script by running `./question01.py`. This tells the terminal to execute `question01.py` in the current directory. If you were to simply type `question01.py` in your terminal, you would get an error saying that the command "question01.py" was not found. The reason is that your shell will look in the directories in your `$PATH` environment variable in order to find commands. Since `question01.py` is not in your `$PATH`, you receive that error. By prepending the `./`, this tells the shell to instead look in my current working directory and execute `question01.py`. - -How does the shell know _how_ to execute the script? The shebang! For example, try executing the following script by running `./question02.py` (or `$HOME/question02.py`) in the terminal. - -.question02.py -[source,python] ----- -#!/usr/bin/python3 - -import sys - -def main(): - print(f"{sys.executable}") - -if __name__ == "__main__": - main() ----- - -[source,bash] ----- -chmod +x $HOME/question02.py ----- - -You'll notice that the output matches the shebang! Now here is the real test, execute the script from within a bash cell in your notebook. - -[source,ipython] ----- -%%bash - -$HOME/question02.py ----- - -Aha! Even though we are in our notebook, the shell respected the shebang and used `/usr/bin/python3` to execute the script. - -[IMPORTANT] -==== -If you were to run `python3 question02.py` from within a bash cell, the output would rever to our course interpreter at `/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python` -- when passing the script to a particular interpreter, the shebang is ignored. -==== - -Okay, in this project, since the focus is on writing Python scripts, it is a good opportunity to have some fun and use some powerful, pre-built models to do fun things. The following code will use an image classification model to identify the content of a photo. - -.question02_2.py -[source,python] ----- -#!/usr/bin/python3 - -from transformers import ViTFeatureExtractor, ViTForImageClassification -from PIL import Image -import requests - - -def main(): - - image = Image.open("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg") - - feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224') - model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224') - - inputs = feature_extractor(images=image, return_tensors="pt") - outputs = model(**inputs) - logits = outputs.logits - - predicted_class_idx = logits.argmax(-1).item() - print("Predicted class:", model.config.id2label[predicted_class_idx]) - - -if __name__ == "__main__": - main() ----- - -[source,bash] ----- -chmod +x $HOME/question02_2.py ----- - -In a bash cell, execute the script. - -[source,ipython] ----- -%%bash - -$HOME/question02_2.py ----- - -What happens? Did you expect this? Write 1-2 sentences explaining why you expected the output to be different. Finally, correct the script so that it runs correctly. In another bash cell, run the updated script. - -[IMPORTANT] -==== -Please ignore any red warnings you receive as a part of your output from running the corrected `question02_2.py` script. -==== - -[TIP] -==== -If you want to see whether or not the results are accurate, you can display the image with the following code inside a code cell. - -[source,python] ----- -from IPython import display -display.Image("/depot/datamine/data/coco/unlabeled2017/000000000008.jpg") ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- `question02.py` script. -- `question02_2.py` script. -==== - -=== Question 3 - -++++ - -++++ - -I hope that the previous two questions gave you a pretty solid understanding that if we want packages to be available when running a Python script, we need to use the appropriate shebang or Python interpreter. - -Right now, our script, `question02_2.py` is not very useful. No matter what we do it will continue to load up and analyze the same old image. That is pretty boring. One of the primary things you can do with scripts is read _arguments_ passed to the script and do something. For example, we passed the argument `question01.py` to the `python3` program. In the same way, we could, for example, pass an absolute path to an image to our script and have it print out the output for the given image! Our script would be much more useful. We could do things like: - -[source,bash] ----- -./my_script.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -./my_script.py /depot/datamine/data/coco/unlabeled2017/000000000013.jpg ----- - -Copy your `question02_2.py` script to a new script called `question03.py`. Modify `question03.py` to accept a single argument, an absolute path to an image, and use that argument in place of the default `000000000008.jpg` image. Use `sys.argv` to accomplish this. - -Test it out from within a bash cell in your notebook. - -[source,ipython] ----- -%%bash - -$HOME/question03.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -$HOME/question03.py /depot/datamine/data/coco/unlabeled2017/000000000013.jpg -$HOME/question03.py ----- - -[IMPORTANT] -==== -If no argument is passed to `question03.py`, use the `000000000008.jpg` image as a default. -==== - -[IMPORTANT] -==== -Make sure to import `sys` so you have access to the `sys.argv` variable. - -[source,python] ----- -import sys ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Typically, scripts or CLIs (command line interfaces) have a bunch of options. For example, you read about the various options of a tool like `grep` or `awk` by running the following in a terminal. - -[source,bash] ----- -man awk -man grep ----- - -For example, `grep` has the option, `-i` which makes the search case insensitive. - -[source,bash] ----- -grep -i 'ok' something.txt ----- - -In addition, often times options have both a long form or short form. For example `grep -f somefile.txt` is the same as `grep --file=somefile.txt`. - -Create a new script called `question04.py` that accepts a single argument called `--detailed` or `-d`, in either short form or long form. If the flag is present, instead of using the "google/vit-base-patch16-224" model, which outputs 1 of 1000 classes, it will instead use the "microsoft/beit-base-patch16-224-pt22k-ft22k" model (see https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k[here]) that will output 1 of the 21841 classes. Some examples of how the script should work are below. - -[IMPORTANT] -==== -Make sure to give your script executable permissions. - -[source,bash] ----- -chmod +x $HOME/question04.py ----- -==== - -[source,ipython] ----- -%%bash - -./question04.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg --detailed -./question04.py /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -d -./question04.py --detailed /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -./question04.py -d /depot/datamine/data/coco/unlabeled2017/000000000008.jpg ----- - -.Output ----- -Predicted class: surfing, surfboarding, surfriding -Predicted class: surfing, surfboarding, surfriding -Predicted class: surfing, surfboarding, surfriding -Predicted class: surfing, surfboarding, surfriding ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -As you can imagine, adding flags and options can blow up the logic and size of your script pretty quickly! Luckily, there is the `argparse` package. This package takes care of parsing and handling program input. https://docs.python.org/3/library/argparse.html[Here] are the official docs and https://docs.python.org/3/howto/argparse.html[here] is a decent tutorial on how to use it. - -Use the `argparse` package to parse your arguments instead of using `sys.argv`. As long as the new version of your script accepts both the long and short forms of the `--detailed` flag, and uses `argparse`, you will receive full credit. However, please do you best to make the script as robust as possible! In addition, you _can_ change the behavior of the arguments as long as the details flags work. - -In bash cells, show at least 3 examples using your new script, `question05.py`, which uses `argparse` instead of `sys.argv`. - -[TIP] -==== -If you want a solid template, you may use the following code to start. You will need to tweak it in order to make it work, however, it is a decent place to start. - -[source,python] ----- -import argparse -import pandas as pd - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - some_parser = subparsers.add_parser("something", help="") - some_parser.add_argument("-o", "--output", help="directory to output file(s) to") - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "something": - something() - -if __name__ == "__main__": - main() ----- - -An accompanying set of tests would be: - -[source,ipython] ----- -%%bash - -./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg --detailed -./question05.py classify /depot/datamine/data/coco/unlabeled2017/000000000008.jpg -d ----- - -.Output ----- -Predicted class: surfing, surfboarding, surfriding -Predicted class: seashore, coast, seacoast, sea-coast -Predicted class: seashore, coast, seacoast, sea-coast ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project09.adoc deleted file mode 100644 index 0d4c53556..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project09.adoc +++ /dev/null @@ -1,142 +0,0 @@ -= STAT 29000: Project 9 -- Spring 2022 - -**Motivation:** Python is an interpreted language (as opposed to a compiled language). In a compiled language, you are (mostly) unable to run and evaluate a single instruction at a time. In Python (and R — also an interpreted language), we can run and evaluate a line of code easily using a repl. In fact, this is the way you’ve been using Python to date — selecting and running pieces of Python code. Other ways to use Python include creating a package (like numpy, pandas, and pytorch), and creating scripts. You can create powerful CLI’s (command line interface) tools using Python. In this project, we will explore this in detail and learn how to create scripts that accept options and input and perform tasks. - -**Context:** This is the second (of two) projects where we will learn about creating and using Python scripts. - -**Scope:** Python, argparse - -.Learning Objectives -**** -- Write a python script that accepts user inputs and returns something useful. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/*` - -== Questions - -=== Question 1 - -Scripts are _really_ powerful. It is cool to think about baking tons of powerful functionality into a single file, sharing it with others, and enabling them to use what you built. When you build a script, CLI (command line interface), or some other Python tool that accepts inputs and options, `argparse` is _really_ the go to package. It hides away lots of the complex logic that can be involved with programs that have lots of inputs or lots of options. We want to encourage you to use and learn more about this package. - -The argparse documentation is a bit rough, and can take some time to navigate, understand, and figure out how to do what you _want_ to do. https://docs.python.org/3/library/argparse.html[This] is the official documentation, and https://docs.python.org/3/howto/argparse.html#id1[this] is the official `argparse` tutorial. - -The template we provided from the previous project is a pretty great starting point. The upside to the template is that it is flexible enough to handle most scenarios for a given a script. The downside is that it may be _more_ than what is needed for a given script, and as a result, involve an unnecessary extra command (for example `something`, if you look at the template. Instead of `./myscript something some_other_input` you _could_ have `./myscript some_other_input`.). The same template can be found below. - -[source,python] ----- -import argparse -import pandas as pd - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - some_parser = subparsers.add_parser("something", help="") - some_parser.add_argument("-o", "--output", help="directory to output file(s) to") - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "something": - something() - -if __name__ == "__main__": - main() ----- - -At the heart of the template script is the subparsers. This is what is responsible for both the flexibility, and potential to have an extra, not necessary argument. https://docs.python.org/3/library/argparse.html#sub-commands[This] links to the section in the documentation on subparsers. - -Let's imagine that you were tasked to write a script for a media company. This company does the weather, news, and sports (among other imaginary things). In as much detail as you can muster, explain _why_ using subparsers for such a script would be useful. Assume that this script would fulfill special tasks related to all 3 categoriesof media. Explain why or why not it would be easily possible to bake functionality for all 3 categories of media into a single script _without_ using subparsers. All answers showing a strong effort will receive full credit. If you created any experimental scripts to help understand subparsers, please include those in a code cell in your notebook! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -A project based around writing a script is the perfect opportunity to get creative! If you are part of Corporate Partners, you could use this project as a chance to implement something useful for your team! If you aren't, you could write a script to help automate some task that you find you need to repeat a lot. Some ideas of scripts would be: - -- Write a script that scrapes recent sports data about one or more teams and prints the output to the screen, neat and formatted (and maybe even https://github.com/Textualize/rich[colored] if you want). -- Write a script that processes and cleans up a specifically formatted type of data and inserts it into a database. -- Write a script that processes images or text and outputs some sort of summary or analysis (you could use the `transformers` package we used in the previous project for this! Just make sure the script is unique and not like the script from the previous project). -- Write a script that accepts an image and prints out the ascii art version of the image. -- Write a script that accepts an image and determines whether or not your face is in the image. -- Write a script that solves a sudoku or other puzzle (wordle?). -- Write a script that accepts images and creates a collage or mosaic of the images. - -All of the examples above have the capability of having any number of flags or options that could slightly change the programs behavior. For example, the ascii art program could have an option that randomly colors the characters, or the collage script may have an option that makes the images black and white, etc. - -Write a script, in two stages. The script could be more complicated or less complicated than any/all of the provided examples -- we just want a strong effort. - -[CRITICAL] -==== -- Make sure to document both stages using a combination of code and markdown cells in your jupyter notebook. The readers should be able to read about what your script does, why, and then proceed to look at examples to see it in action. -- Make sure to include your (final) script (as a `.py` file) in your submission. -- Make sure you run your script (with all sorts of combinations of flags and options that shows off the capabilities) from code cells in your notebook. The reader should be able to see and understand what your code does based on the output from the code cells. Since your script may use data we do not have access to, the script does _not_ need to work for anyone, but the full capabilities should be demonstrated with the output from code cells. -==== - -The first stage should involve building up the first version of your script. As mentioned before, your script could do _anything_. We only ask that your script meets the following criteria: - -. The script must have at least 1 optional flag, that, when present, indicates that the script should be run in a different way. -. The script must have `help` text for _every_ argument or flag in the script. -. The script must have at least 1 positional, required argument. -. The script must have at least 1 optional, not required argument. (Note: This is _different_ than the _flag_. This argument should accept a value and not just use defaults, like a flag does). -. The script must accept both a long or short version of _all_ optional arguments. For example `-v` and `--verbose`. - -The second stage is to _enhance_ your first version of your script. This could be via arguments and options that are included in `argparse`, or by using another package like this https://github.com/Textualize/rich[rich package]. This enhancement could be _anything_, the only requirement is that you explain _what_ the enhancement is and _why_ it is an enhancement. - -[TIP] -==== -Remember, you should use our Python environment's shebang so that all of our packages are ready to use. - -`#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python` -==== - -[TIP] -==== -Remember, you will probably need to give your script execute permissions before you can run it. If your script is called `myscript.py`, and it lives in your `$HOME` directory, you could give execute permissions as follows. - -[source,bash] ----- -chmod +x $HOME/myscript.py ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 (optional, 0 pts) - -Swap scripts with a friend and run your friends script with a variety of flags and options. Suggest 1 or more improvements your friend could make to the script. Trade back and implement the suggestion(s). - -[IMPORTANT] -==== -We'd love for you to do this! Please make sure to put your friends name at the top of your solution to this question, so we can know you collaborated on this problem. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project10.adoc deleted file mode 100644 index a4bcf7ede..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project10.adoc +++ /dev/null @@ -1,264 +0,0 @@ -= STAT 29000: Project 10 -- Spring 2022 - -**Motivation:** The use of a suite of packages referred to as the `tidyverse` is popular with many R users. It is apparent just by looking at `tidyverse` R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We've covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -The "tidyverse" consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `magrittr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham's book, https://r4ds.had.co.nz/[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: -.. Transform -.. Visualize -.. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/beer/beers.csv` -- `/depot/datamine/data/beer/reviews_sample.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -The first step in our workflow is to read the data. - -Read the datasets `beers.csv` and `reviews_sample.csv` using the `read_csv` function from `tidyverse` into tibbles called `beers` and `reviews`, respectively. - -[NOTE] -==== -"Tibble" are essentially the `tidyverse` equivalent to data.frames. They function slightly differently, but are so similar (today) that we won't go into detail until we need to. -==== - -In projects 10 and 11, we want to analyze and compare different beers. Note, that in `reviews` each row corresponds to a review by a certain user on a certain date. As reviews likely vary by individuals, we may want to summarize our `reviews` tibble. - -To do that, let's start by deciding how we are going to summarize the reviews. Start by picking one of the variables (columns) from the `reviews` dataset to be our "beer goodness indicator". For example, maybe you believe that the `taste` is important in beverages (seems reasonable). - -Now, determine a summary statistic that we will use to compare beers based on your beer goodnees indicator variable. Examples include `mean`, `median`, `std`, `max`, `min`, etc. Write 1-2 sentences describing why you chose the statistic you chose for your variable(s). You can use annectodal evidence (some reasoning why you think that summary statistics would be appropriate/useful here), or look at the distribution based on plots, or summary statistics to pick your preferred summary statistics for this case. - -[NOTE] -==== -If you are making a plot, please be sure to use the `ggplot` package. -==== - -[NOTE] -==== -If you wanted to have some fun, you could decide to combine different variables into a single one. For instance, maybe you want to take into consideration both `taste` and `smell`, but you want a smaller weight for `smell`. Then, you create a plot of `taste + .5*smell`, and you notice the data is skewed, so you decide to go with the `median`, namely, with `median(taste+.5*smell)`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences describing what is your `beer_goodness_indicator` (variable and summary statistics), and why. -==== - -=== Question 2 - -++++ - -++++ - -Now that we have decided how to compare beers, let's create a new variable called `beer_goodness_indicator` in the reviews dataset. For each `beer_id`, https://dplyr.tidyverse.org/reference/summarise.html?q=summarize#ref-usage[`summarize`] the `reviews` data to get a single `beer_goodness_indicator` based on your answer from question 1. Call this summarized dataset `reviews_summary`. - -[TIP] -==== -`reviews_summary` should be 41822x2 (rows x columns). -==== - -[TIP] -==== -`summarize` is good when you want to keep your data grouped -- it will result in a data.frame with a different number of rows and columns. `mutate` is very similar except it will maintain the original columns, add a new column where the grouped/summarized values are repeated based on the variable the data was grouped by. This may be confusing, but run the following two examples and this will be made clear. - -[source,r] ----- -mtcars %>% - group_by(cyl) %>% - summarize(mpg_mean = mean(mpg)) ----- - -[source,r] ----- -mtcars %>% - group_by(cyl) %>% - mutate(mpg_mean = mean(mpg)) ----- -==== - -[TIP] -==== -You may be wondering what the heck the `%>%` part of the code from the previous tip is. These are pipes from the `magrittr` package. This is used to together functions. For example, `group_by` and `summarize` are two functions that can be chained together. You are passing the output from the previous function as the input to the next function. You'll find this is a very clean and convenient way to express a lot of very common data wrangling tasks! - -It could be as simple as getting the `head` of a dataframe. - -[source,r] ----- -head(mtcars) ----- - -You could instead use pipes: - -[source,r] ----- -mtcars %>% - head() ----- - -Why? This second version is arguably easier to read, and it is easier to edit. You could easily want to add a column to the dataframe first. - -[source,r] ----- -mtcars %>% - mutate(my_new_column = mean(cyl)) %>% - head() ----- - -Now, if we had the non-piped version it would be something like: - -[source,r] ----- -mtcars <- mtcars %>% - mutate(my_new_column = mean(cyl)) - -head(mtcars) ----- - -Or an even better example would be: - -[source,r] ----- -mtcars %>% - round() %>% - head() ----- - -Versus: - -[source,r] ----- -head(round(mtcars)) ----- -==== - -[TIP] -==== -`mutate` in particular is extremely useful. Try to perform the same operation using `pandas` and you will quickly realize how _nice_ some of the `tidyverse` functionality is. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Head of `reviews_summary` dataset. -==== - -=== Question 3 - -++++ - -++++ - -Let's combine our `beers` dataset with `reviews_summary` into a new dataset called `beers_reviews` that contains only beers that appears in *both* datasets. Use the appropriate https://dplyr.tidyverse.org/articles/two-table.html?q=left_join#types-of-join[`join`] function from `tidyverse` (`inner_join`, `left_join`, `right_join`, or `full_join`) to solve this problem. Since you saw some examples using pipes in the previous question (`%>%`) -- use pipes from here on out. - -What are the dimensions of the resulting `beers_reviews` dataset? How many beers did _not_ appear in both datasets? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Result of running `dim(beers_reviews)` -==== - -=== Question 4 - -++++ - -++++ - -Ok, now we have the dataset ready to analyze! For beers that are available during the entire year (see `availability`), is there a difference between `retired` and not retired beers in terms of `beer_goodness_indicator`? - -1. Start by subsetting the dataset using https://dplyr.tidyverse.org/reference/filter.html[`filter`]. -2. Create some data-driven method to answer this question. You can make a plot, get summary statistics (average `beer_goodness_indicator`, table comparing # of beers with `beer_goodness_indicator` > 4 for each category, etc). You can use multiple methods to answer this question! Have fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences answering the comparing `retired` and not retired beers in terms of `beer_goodness_indicator` based on your chosen method(s). Did the results surprise you? -- 1-2 sentences explaining what data-driven method(s) you decided to use and why. -==== - -=== Question 5 - -++++ - -++++ - -Let's compare different styles of beer based on our `beer_goodness_indicator` average. Create a Cleveland dotplot (using `ggplot`) comparing the average `beer_goodness_indicator` for each style in `beers_reviews`. Make sure to use the `tidyverse` functions to answer this question and to use `ggplot`. - -[TIP] -==== -The code below creates a Cleveland dotplot comparing `Sepal.Length` variation per `Species` using the `iris` dataset. - -[source,r] ----- -iris %>% - group_by(Species) %>% - summarize(petal_length_var = sd(Petal.Length)) %>% - arrange(desc(petal_length_var)) %>% -ggplot() + - geom_point(aes(x = Species, y = petal_length_var)) + - coord_flip() + - theme_classic() + - labs(x = "Petal length variation") ----- -==== - -[TIP] -==== -You can use the function https://dplyr.tidyverse.org/reference/top_n.html?q=top_n#null[`top_n(x)`] in combination with https://dplyr.tidyverse.org/articles/grouping.html?q=arrange#arrange[`arrange`] to subset to show only the top x styles. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project11.adoc deleted file mode 100644 index a009cfbf8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project11.adoc +++ /dev/null @@ -1,190 +0,0 @@ -= STAT 29000: Project 11 -- Spring 2022 - -**Motivation:** Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. `tidyverse` is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We have covered a few topics on the `tidyverse` packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks. This is the second in a series of 5 projects focused around using `tidyverse` packages to solve data-driven problems. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/beer/beers.csv` -- `/depot/datamine/data/beer/reviews_sample.csv` - -== Questions - - -=== Question 1 - -++++ - -++++ - -Let's pick up where we left in the previous project. Copy and paste your commands from questions 1 to 3 that result in our `beers_reviews` dataset. - -Using the pipelines (remember, the `%>%`), combine the necessary parts of questions 2 and 3, removing the need to have an intermediate `reviews_summary` dataset. This is a great way to practice and get a better understanding of `tidyverse`. - -Your code should read the datasets, summarize the reviews data similarly to what you did in question 2, and combine the summarized dataset with the `beers` dataset. This should all be accomplished from a single chunk of "piped-together" code. - -[TIP] -==== -Feel free to remove the `reviews` dataset after we have the `beers_reviews` dataset. - -[source,r] ----- -rm(reviews) ----- -==== - -[TIP] -==== -If you want to update how you calculated your `beer_goodness_indicator` from the previous project, this would be a great time to do so! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Are there any differences in terms of `abv` between beers that are available in specific seasons? - -[NOTE] -==== -ABV refers to the alcohol by volume of a beer. The higher the ABV, the more alcohol is in the beer. -==== - -1. Filter the `beers_reviews` dataset to contain beers available only in a specific season (`Fall`, `Winter`, `Spring`, `Summer`). -+ -[TIP] -==== -Only click below if you are stuck! - -https://dplyr.tidyverse.org/reference/filter.html[This] function will help you do this operation. -==== -+ -2. Make a side-by-side boxplot comparing `abv` for each season `availability`. -+ -[TIP] -==== -Only click below if you are stuck! - -https://ggplot2.tidyverse.org/reference/geom_boxplot.html[This] function will help you do this operation. -==== -+ -3. Make sure to use the `labs` function to have nice x-axis label and y-axis label. -+ -[TIP] -==== -https://ggplot2.tidyverse.org/reference/labs.html?q=labs#null[This] is more information on `labs`. -==== - -Use pipelines, resulting in a single chunk of "piped-together" code. - -[TIP] -==== -Use the `fill` argument to https://ggplot2.tidyverse.org/reference/geom_boxplot.html[this] function to color your boxplots differently for each season. -==== - -Write 1-2 sentences comparing the beers in terms of `abv` between the specific seasons. Are the results surprising or did you expect them? - -[TIP] -==== -The `aes` part of many `ggplot` plots may be confusing at first. In a nutshell, `aes` is used to match x-axis and y-axis values to columns of data in the given dataset. You should read https://ggplot2.tidyverse.org/reference/aes.html[this] documentation and the examples carefully to better understand. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences comparing the beers in terms of `abv` between the specific seasons. Are the results surprising or did you expect them? -==== - -=== Question 3 - -++++ - -++++ - -Modify your code from question 2 to: - -1. Create a new variable `is_good` that is 1 or TRUE if `beer_goodness_indicator` is greater than 3.5, and 0 or FALSE otherwise. -2. _Facet_ your boxplot based on `is_good`. The resulting graphic should make it easy to compare the "good" vs "bad" beers for each season. -+ -[TIP] -==== -https://thedatamine.github.io/the-examples-book/r.html#r-facet_grid[`facet_grid`] and https://thedatamine.github.io/the-examples-book/r.html#r-facet_wrap[`facet_wrap`] are two other functions that can be a bit confusing at first. With that being said, they are incredible powerful and make creating really impressive graphics very straightforward. -==== - -[IMPORTANT] -==== -Make sure to use piping `%>%` as well as layers (`+`) to create your final `ggplot` plot, using a single chunk of piped/layered code. -==== - -How do beers differ in terms of ABV and being considered good or not (based on our definition) for the different seasons? Write 1-2 sentences describing what you see based on the plots. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences answering the question. -==== - -=== Question 4 - -Modify your code from question 3 to answer the question based on summary statistics instead of graphical displays. - -Make sure you compare the ABV per season `availability` and `is_good` using `mean`, `median` and `sd`. Your final dataframe should have 8 rows and the following columns: `is_good`, `availability`, `mean_abv`, `median_abv`, `std_abv`. - -[TIP] -==== -The following function will be useful for this question: https://dplyr.tidyverse.org/reference/filter.html[`filter`], https://dplyr.tidyverse.org/reference/mutate.html[`mutate`], https://dplyr.tidyverse.org/reference/group_by.html[`group_by`], https://dplyr.tidyverse.org/reference/summarise.html[`summarize`] (within summarize: `mean`, `median`, `sd`). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -In this question, we want to make comparison in terms of `ABV` and `beer_goodness_indicator` for US states. - -Feel free to use whichever data-driven method you desire to answer this question! You can take summary statistics, make a variety of plots, and even filter to compare specific US states -- you can even create new columns combining states (based on region, political affiliation, etc). - -Write a question related to US states, ABV and our "beer_goodness_indicator". Use your data-driven method(s) to answer it (if only anecdotally). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Write 1-2 sentences explaining your question and data-driven method(s). -- Write 1-2 sentences answering your question. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project12.adoc deleted file mode 100644 index 6f3748593..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project12.adoc +++ /dev/null @@ -1,185 +0,0 @@ -= STAT 29000: Project 12 -- Spring 2022 - -**Motivation:** As we mentioned before, data wrangling is a big part in any data driven project. https://www.amazon.com/Exploratory-Data-Mining-Cleaning/dp/0471268518["Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis."] Therefore, it is worth to spend some time mastering how to best tidy up our data. - -**Context:** We are continuing to practice using various `tidyverse` packages, in order to wrangle data. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use `mutate`, `pivot`, `unite`, `filter`, and `arrange` to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using `group_by`, `mutate`, `summarize`, and `transform` functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -As usual, we will start by loading `tidyverse` package (using the `read_csv` function) and reading the processed data `complaints.csv` into a tibble called `dat`. - -The first step in many projects is to define the problem statement and goal. That is, to understand why the project is important and what are our desired delivarables. For this project and the next, we will assume that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions. - -What is the type of data in the `date_sent_to_company` and `date_received` columns? Do you think that these columns are in a good format to calculate the wait time between receiving and sending the complaint to the company? No need to overthink anything -- from your perspective, how simple/complicated would be the steps to calculate the number of days between received and sent to the company, if the data remaining in the current format? - -[TIP] -==== -The `glimpse` function is a good function to get a sample of many columns of data at once. Althought it may be more difficult to read at first (than say, `head`), since it lists a single row for each column, it is better when you have many columns in a dataset. - -Also, try to keep using the pipes (`%>%`) for the entire project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1 sentence answering the type of columns `date_sent_to_company` and `date_received`. -- 1-2 sentences commenting on if you think it would be easy to calculate calculate the wait time between receiving and sending the complaint to the company if the type of data remained as is. -==== - -=== Question 2 - -Tidyverse has a few "data type"-specific packages. For example, the package `lubridate` is fantastic when dealing with dates, and the package `stringr` was made to help when dealing with text (`chr`). Although `lubridate` is a part of the `tidyverse` universe, it does not get loaded when we run `library(tidyverse)`. - -Begin this question by loading the `lubridate` package. Use the appropriate function to convert the columns refering to dates to a date format. - -[TIP] -==== -There are lots of really great (and "official") cheat sheets for `tidyverse` packages. I find these immensely useful and almost always pull the cheat sheet up when I use a `tidyverse` package. - -https://www.rstudio.com/resources/cheatsheets/[Here] are the cheat sheets. - -https://raw.githubusercontent.com/rstudio/cheatsheets/main/lubridate.pdf[Here] is the `lubridate` cheat sheet, which I think is particularly good. -==== - -Try to solve this question using the `mutate_at` function. - -[IMPORTANT] -==== -You will notice a pattern within the `tidyverse` of functions named `*_at`, `*_if`, and `*_all`. For example, for the `mutate` and `summarize` functions there are versions like `mutate_at` or `summarize_if`, etc. These variants of the functions are useful for applying the functions to relevant columns without having to specify individual columns by name. -==== - -[TIP] -==== -Take a look at the functions: `ydm`, `ymd`, `dym`, `mdy`, `myd`, `dmy`. -==== - -[TIP] -==== -If you are using `mutate_at`, run the followings command and see what happens. - -[source,r] ----- -dat %>% - select(contains('product')) %>% - head() - -dat %>% - summarize_at(vars(contains('product')), function(x) length(unique(x))) - -dat %>% - group_by(product) %>% - summarize_if(is.numeric, mean) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Result from running `glimpse(dat)`. -==== - -=== Question 3 - -Add a new column called `wait_time` that is the difference between `date_sent_to_company` and `date_received` in days. - -[TIP] -==== -You may want to use the argument `units` in the `difftime` function. -==== - -[TIP] -==== -Remember that `mutate` is the function you want to use to add a new column based on other columns. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Frequently, we need to perform many data wrangling tasks: changing data types, performing operations more easily, summarizing the data, and pivoting to be enable plotting certain types of plots. - -So far, we spent most of the project doing just that! Now that we have the `wait_time` in a format that allows us to plot it, let's start creating some tips and plots to increase costumer satisfaction. One thing we may want to do is give the user information on the `wait_time` based on how the complaint was submitted (`submitted_via`). Compare `wait_time` values by how they are submitted. - -[NOTE] -==== -Keep in mind that we want to present this information in a way that would be helpful to costumers. For example, if you summarized the data, you could present the information as a tip and include the summarized `wait_time` values for the fastest and slowest methods. If you are making a plot and the plot has tons of outliers, maybe we want to consider cutting our axis (or filtering) the data to include just the certain values. -==== - -Be sure to explain your reasoning for each step of your analysis. If you are summarizing, why did you pick this method, and why are you summarizing the way you are (for example, are you using the average time, the median time, the maximum time, the `mean(wait_time) + 3*std(wait_time)`)? You may also want to create 3 categories of `wait_time` (small, medium, high) and do a `table` between the categorical wait time and submission types. Why are you presenting the information the way you are? - -[NOTE] -==== -Figuring out how to present the information to help someone make a decision is an important step in any project! You may very well be presenting to someone that is not as familiar with data science/statistics/computer science as you are. -==== - -[TIP] -==== -If you are creating categorical wait time, take a look at the https://dplyr.tidyverse.org/reference/case_when.html[`case_when`] function. -==== - -[TIP] -==== -One example could be: - ----- -The plot below shows the average time it takes for the company to receive your complaint after you sent it based on _how_ you sent it. Note that, on average, it takes XX days to get a response if you submitted via YY. Alternatively, it takes, on avaerage, twice as long to receive a response if you submit a complain via ZZ. Be sure to keep this in mind when submitting a complaint. ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaning your reasoning for how you presented the information. -- Information for costumer to make decision (plot, tip, etc). -==== - -=== Question 5 - -Note that we have a column called `timely_response` in our `dat`. It may or may not (in reality) _be_ related to `wait_time`, however, we would expect it to be. What would you expect to see? Compare `wait_time` to `timely_response` using any technique you'd like. You can use the same idea/technique from question 4, or you can pick something else entirely different. It is completely up to you! - -Would this information be relevant to include in a tip or dashboard for a costumer to make their decision? Why or why not? Would you combine this information with the one for `wait_time`? If so, how? - -Sometimes there are many ways to present similar pieces of information, and we must decide what we believe makes most sense, and what will be most helpful when making a decision. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences comparing `wait_time` for timely and not timely responses. -- 1-2 sentences explaining whether you would include this information for costumers, and why or why not? If so, how would you include it? -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project13.adoc deleted file mode 100644 index 7c885ff19..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project13.adoc +++ /dev/null @@ -1,146 +0,0 @@ -= STAT 29000: Project 13 -- Spring 2022 - -**Motivation:** Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we've practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data. - -**Context:** We will continue to gain familiarity with the `tidyverse` suite of packages (including `ggplot`), and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use `mutate`, `pivot`, `unite`, `filter`, and `arrange` to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using `group_by`, `mutate`, `summarize`, and `transform` functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -++++ - -++++ - -Just like in Project 12, we will start by loading `tidyverse` package and reading (using the `read_csv` function) the processed data `complaints.csv` into a tibble called `dat`. Make sure to also load the `lubridate` package. - -In project 12, we stated that that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions. - -We started by providing some information on wait time, `timely_response`, and it's association with how the complaint was submitted (`submitted_via`). - -Let's continue our exploration of this dataset to provide valuable information to clients. Create a new column called `day_sent_to_company` that contais the day of the week the complaint was sent to the company (`date_sent_to_company`). To create the new data, use some of your code from project 12 that changes the format of `date_sent_to_company` to the correct format, pipes (`%>%`), and the appropriate function from `lubridate`. - -[NOTE] -==== -Some students asked about whether or not you _need_ to use the pipes (`%>%`). The answer is no! Of course, you are free to use them if you'd like. I _think_ that with a little practice, it will become more natural. "Tidyverse code" tends too look a lot like: - -[source,r] ----- -dat %>% - filter(...) %>% - mutate(...) %>% - summarize(...) %>% - ggplot() + - geom_point(...) + - ... ----- - -Some people like it, some don't. You can draw your own conclusions, but I'd give all common methods a shot to see what you prefer. -==== - -Also, try to keep using the pipes (`%>%`) for the entire project (again, you don't _need_ to, but we'd encourage you to try). Your code for question one should look something like this: - -[source,r] ----- -dat <- dat %>% - insert_correct_function_here( - insert_code_to_change_weekday_complaint_sent_to_proper_format_here, - insert_code_to_get_day_of_the_week_here - ) ----- - -[TIP] -==== -You may want to use the argument `label` form the `wday` function. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Before we continue with all of the fun, let's do some sanity checks on the column we just created. Sanity checks are an important part any data science project, and should be performed regularly. - -Use your critical thinking to perform at least two sanity checks on the new column `day_sent_to_company`. The idea is to take a quick look at the data from this column and check if it seems to make sense to us. Sometimes we know the exact values we should get and that helps be even more certain. Sometimes those sanity checks are not as foolproof, and are just ways to get a feel for the data and make sure nothing is looking weird right away. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining what you used to perform sanity checks, and what are your results. Do you feel comfortable moving forward with this new column? -==== - -=== Question 3 - -++++ - -++++ - -Using your code from questions 1 and 2, create another new column called `day_received` that is the week day the complaint was received. Use sanity checks to double check that everything seems to be in order. - -Let's use these new columns and make some additional recommendations to our consumers! Using at least one of the columns `day_received` and `day_sent_to_company` with the rest of the data to see whether the consumer disputed the result (`consumer_disputed`), create a tip or a plot to help consumer make decisions. - -[NOTE] -==== -Note that the column `consumer_disputed` is a character column, so make sure you take that into consideration. Depending on how you want to summmarize and/or present the information you may need to modify this format, or use a different function to get that information. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Recommendation for consumer in form of a chart with a legend and/or tip. -==== - -=== Question 4 - -++++ - -++++ - -Looking at the columns we have in the dataset, come up with a question whose answer can be used to help consumers make decisions. It is ok if the answer to your question doesn't provide the most insightful information -- for instance, finding out two variables are not correlated can still be valuable information! - -Use your skills to answer the question. Transform your answer to a "tip" with an accompanying plot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences with your question. -- Answer to your question. -- Recommendation to consumer via tip and plot. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project14.adoc deleted file mode 100644 index 4a4ba25e3..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-project14.adoc +++ /dev/null @@ -1,161 +0,0 @@ -= STAT 29000: Project 14 -- Spring 2022 - -**Motivation:** Rearranging data to and from "long" and "wide" formats _sounds_ like a difficult task, however, `tidyverse` has a variety of function that make it easy. - -**Context:** This is the last project for the course. This project has a focus on how data can change when grouped differently, and using the `pivot` functions. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/death_records/DeathRecords.csv` - -== Questions - -=== Question 1 - -Calculate the average age of death for each of the `MaritalStatus` values and create a `barplot` using `ggplot` and `geom_col`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Now, let's further group our data by `Sex` to see how the patterns change (if at all). Create a side-by-side bar plot where `Sex` is shown for each of the 5 `MaritalStatus` values. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -In the previous question, before you piped the data into `ggplot` functions, you likely used `group_by` and `summarize`. Take, for example, the following. - -[source,r] ----- -dat %>% - group_by(MaritalStatus, Sex) %>% - summarize(age_of_death=mean(Age)) ----- - -.output ----- -MaritalStatus Sex age_of_death - -D F 70.34766 -D M 65.60564 -M F 69.81002 -M M 73.05787 -S F 56.83075 -S M 49.12891 -U F 80.80274 -U M 80.27476 -W F 85.69817 -W M 83.98783 ----- - -Is this data "long" or "wide"? - -There are multiple ways we could make this data "wider". Let's say, for example, we want to rearrange the data so that we have the `MaritalStatus` column, a `M` column, and `F` column. The `M` column contains the average age of death for males and the `F` column the same for females. While this may sound complicated to do, `pivot_wider` makes this very easy. - -Use `pivot_wider` to rearrange the data as described. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Create a ggplot plot for each month. Each plot should be a barplot with the `as.factor(DayOfWeekOfDeath)` on the x-axis and the count on the y-axis. The code below provides some structure to help get you started. - -[source,r] ----- -g <- list() # to save plots to -for (i in 1:12) { - g[[i]] <- dat %>% - filter(...) %>% - ggplot() + - geom_bar(...) -} - -library(patchwork) -library(repr) - -# change plot size to 12 by 12 -options(repr.plot.width=12, repr.plot.height=12) - -# use patchwork to display all plots in a grid -# https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Question 4 is a bit tedious. `tidyverse` provides a _much_ more ergonomic way to create plots like this. Use https://thedatamine.github.io/the-examples-book/r.html#r-facet_wrap[`facet_wrap`] to create the same plot. - -[TIP] -==== -You do _not_ need to use a loop to solve this problem anymore. In face, you only need to add 1 more line of code to this part. - -[source,r] ----- -dat %>% - filter(....) %>% - ggplot() + - geom_bar(...) + - # new stuff here ----- -==== - -Are there any patterns in the data that you find interesting? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 6 - -It has been a fun year. We hope that you learned something new! - -- Write down 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 29000). -- Write down 3 (or more) of your favorite projects and/or topics you wish you were able to learn _more_ about. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-projects.adoc deleted file mode 100644 index de2ff0a0e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/29000/29000-s2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= STAT 29000 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$29000-s2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat29000[https://piazza.com/purdue/fall2021/stat29000] - -=== Link - -https://piazza.com/purdue/fall2021/stat29000/home[https://piazza.com/purdue/fall2021/stat29000/home] - -== Syllabus - -See xref:spring2022/logistics/s2022-syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project01.adoc deleted file mode 100644 index dce16c1ec..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project01.adoc +++ /dev/null @@ -1,132 +0,0 @@ -= STAT 39000: Project 1 -- Spring 2022 - -**Motivation:** Welcome back! This semester _should_ be a bit more straightforward than last semester in many ways. In the first project back, we will do a bit of UNIX review, a bit of Python review, and I'll ask you to learn and write about some terminology. - -**Context:** This is the first project of the semester! We will be taking it easy and _slowly_ getting back to it. - -**Scope:** UNIX, Python - -.Learning Objectives -**** -- Differentiate between concurrency and parallelism at a high level. -- Differentiate between synchronous and asynchronous. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/` - -== Questions - -=== Question 1 - -++++ - -++++ - -Google the difference between synchronous and asynchronous -- there is a _lot_ of information online about this. - -Explain what the following tasks are (in day-to-day usage) and why: asynchronous, or synchronous. - -- Communicating via email. -- Watching a live lecture. -- Watching a lecture that is recorded. - -[WARNING] -==== -Please review our updated xref:book:projects:submissions.adoc[submission guidelines] before submitting your project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Given the following scenario and rules, explain the synchronous and asynchronous ways of completing the task. - -You have 2 reports to write, and 2 wooden pencils. 1 sharpened pencil will write 1/2 of 1 report. You have a helper that is willing to sharpen 1 pencil at a time, for you, and that helper is able to sharpen a pencil in the time it takes to write 1/2 of 1 report. - -[IMPORTANT] -==== -You can assume you start with 2 sharpened pencils. Of course, if you assumed otherwise before the project was modified, you will get full credit with a different assumption. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write Python code that simulates the scenario in question (2) that is synchronous. Make the time it takes to sharpen a pencil be 2 seconds. Make the time it takes to write .5 reports 5 seconds. - -[TIP] -==== -Use `time.sleep` to accomplish this. -==== - -How much time does it take to write the reports in theory? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -[IMPORTANT] -==== -The original text of the question is below. This is too difficult to do for this project. For this question, **you are not required to write the code yourself**. Rather, just answer the theoretical component to the question. - -This question will be addressed in a future project, with better examples, and many more hints. -==== - -Read https://stackoverflow.com/questions/50757497/simplest-async-await-example-possible-in-python[the StackOverflow post] and write Python code that simulates the scenario in question (2) that is asynchronous. The time it takes to sharpen a pencil is 2 seconds and the time it takes to write .5 reports is 5 seconds. - -[TIP] -==== -Use _async_ functions and `asyncio.sleep` to accomplish this. -==== - -How much time does it take to write the reports in theory? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -In your own words, describe the difference between concurrency and parallelism. Then, look at the flights datasets here: `/depot/datamine/data/flights/subset`. Describe an operation that you could do to the entire dataset as a whole. Describe how you (in theory) could parallelize that process. - -Now, assume that you had the entire frontend system at your disposal. Use a UNIX command to find out how many cores the frontend has. If processing 1 file took 10 seconds to do. How many seconds would it take to process all of the files? Now, approximately how many seconds would it take to process all the files if you had the ability to parallelize on this system? - -Don't worry about overhead or the like. Just think at a very high level. - -[TIP] -==== -Best make sure this sounds like a task you'd actually like to do -- I _may_ be asking you to do it in the not-too-distant future. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project02.adoc deleted file mode 100644 index e9f9128c1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project02.adoc +++ /dev/null @@ -1,588 +0,0 @@ -= STAT 39000: Project 2 -- Spring 2022 - -**Motivation:** The the previous project, we very slowly started to learn about asynchronous vs synchronous programming. Mostly, you just had to describe scenarios, whether they were synchronous or asynchronous, and you had to explain things at a high level. In this project, we will dig into some asynchronous code, and learn the very basics. - -**Context:** This is the first project in a series of three projects that explore sync vs. async, parallelism, and concurrency. - -**Scope:** Python, coroutines, tasks - -.Learning Objectives -**** -- Understand the difference between synchronous and asynchronous programming. -- Identify, create, and await coroutines. -- Properly identify the order in which asynchronous code is executed. -- Utilize 1 or more synchronizing primitives to ensure that asynchronous code is executed in the correct order. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/noaa/*.csv` - -== Questions - -=== Question 1 - -In the _original_ version of the previous project, I gave you the following scenario. - -[quote, , the-examples-book.com] -____ -You have 2 reports to write, and 2 wooden pencils. 1 sharpened pencil will write 1/2 of 1 report. You have a helper that is willing to sharpen 1 pencil at a time, for you, and that helper is able to sharpen a pencil in the time it takes to write 1/2 of 1 report. You can assume that you start with 2 sharpened pencils. -____ - -I then asked, in question (4), for you to write _asynchronous_ Python code that simulates the scenario. In addition, I asked you to write the amount of time proper asynchronous code would take to run. While the first part of the question was unfair to ask (yet), the second part was not. - -In this _asynchronous_ example, the author could start with the first sharpened pencil and write 1/2 of the report in 5 seconds. Next, hand the first pencil off to the assistant to help sharpen it. While that is happening, use the second pencil to write the second half of the first report. Next, receive the first (now sharpened) pencil back from the assistant and hand the second pencil to the assistant to be sharpened. While the assistant was sharpening the second pencil, you would write the first half of the second report. The assistant would return the (now sharpened) second pencil back to you to finish the second report. This process would (in theory) take 20 seconds as the assistant would be sharpening pencils at the same time you are writing the report. As an effect, you could exclude the 4 seconds it takes to sharpen both pencils once, from our synchronous solution of 24 seconds. - -In this project we will examine how to write asynchronous code that simulates the scenario, in a variety of ways that will teach you how to write asynchronous code. At the end of the project, you will write your own asynchronous code that will speed up a web scraping task. Let's get started! - -First thing is first. A few extremely astute students noticed an issue when trying to run async code in Jupyter Lab. Jupyter Lab has its own event loop already running, which causes problems if you were to try to run your own event loop. To get by this, we can use a package that automatically _nests_ our event loops, so things work _mostly_ as we would expect. - -[source,python] ----- -import asyncio -import nest_asyncio -nest_asyncio.apply() - -asyncio.run(simulate_story()) ----- - -Fill in the skeleton code below to simulate the scenario. Use **only** the provided functions, `sharpen_pencil`, and `write_half_report`, and the `await` keyword. - -[source,python] ----- -async def sharpen_pencil(): - await asyncio.sleep(2) - -async def write_half_report(): - await asyncio.sleep(5) - -async def simulate_story(): - # Write first half of report with first pencil - - # Hand pencil off to assistant to sharpen - - # Write second half of report with second pencil - - # Hand second pencil back to assistant to sharpen - # take first (now sharpened) pencil back from assistant - - # Write the first half of second essay with first pencil - - # Take second (now sharpened) pencil back from assistant - # and write the second half of the second report ----- - -Run the simulation in a new cell as follows. - -[source,ipython] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -asyncio.run(simulate_story()) ----- - -How long does you code take to run? Does it take the expected 20 seconds? If you have an idea why or why not, please try to explain. Otherwise, just say "I don't know". - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -If you don't have any experience writing asynchronous code, this might be pretty confusing! That's okay, it is _much_ easier to get confused writing asynchronous code than it is to write synchronous code. - -Let's break it down. First, the `asyncio.run` function takes care of the bulk of the work. It starts the _event loop_, finalizes asynchronous generators, and closes the threadpool. All you need to take from it is "it takes care of a lot of ugly magic". - -Any function that starts with `async` is an asynchronous function. _Calling_ an async function produces a _coroutine_. A coroutine is a function that has the ability to have its progress be pauses and restarted at will. - -For example, if you called the following async function, it will not execute, but rather it will just create a coroutine object. - -[source,python] ----- -async def foo(): - await asyncio.sleep(5) - print("Hello") - -foo() ----- - -.Output ----- - ----- - -To actually run the coroutine, you need to call the `asyncio.run` function. - -[source,python] ----- -asyncio.run(foo()) ----- - -.Output ----- -Hello ----- - -Of course, it doesn't make sense to call `asyncio.run` for each and every coroutine you create. It makes more sense to spin up the event loop once and handle the processes while it is running. - -[source,ipython] ----- -%%time - -import asyncio -import nest_asyncio -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await foo() - await bar() - -asyncio.run(main()) ----- - -Run the code, what is the output? - -Let's take a step back. _Why_ is asynchronous code useful? What do our `asyncio.sleep` calls represent? One of the slowest parts of a program is waiting for I/O or input/output. It takes time to wait for the operating system and hardware. If you are doing a lot of IO in your program, you could take advantage and perform other operations while waiting! This is what the `asyncio.sleep` calls _represent_ -- IO! - -Any program where the IO speed limits the speed of the program is called _I/O Bound_. Any program where the program speed is limited by how fast the CPU can process the instructions is called _CPU Bound_. Async programming can drastically speed up _I/O Bound_ software! - -Okay, back to the code from above. What is the output? You may have expected `foo` to run, then, while `foo` is "doing some IO (sleeping)", `bar` will run. Then, in a total of 5 seconds, you may have expected "World Hello" to be printed. While the `foo` is sleeping, `bar` runs, gets done in 2 seconds, goes back to `foo` and finishes in another 3 seconds, right? Nope. - -What happens is that when we _await_ for `foo`, Python suspends the execution of `main` until `foo` is done. Then it resumes execution of `main` and suspends it again until `bar` is done for an approximate time of 7 seconds. We want both coroutines to run concurrently, not one at a time! How do we fix it? The easiest would be to use `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -`asyncio.gather` takes a list of awaitable objects and runs them concurrently by scheduling them as a _task_. Running the code above should work as expected, and run in approximately 5 seconds. We gain 2 seconds in performance since both `foo` and `bar` run concurrently. While `foo` is sleeping, `bar` is running and completes. We gain 2 seconds while those functions overlap. - -What is a _task_? You can read about tasks https://docs.python.org/3/library/asyncio-task.html#asyncio.Task[here]. A task is an object that runs a coroutine. The easiest way to create a task is to use the `asyncio.create_task` method. For example, if instead of awaiting both `foo` and `bar`, we scheduled `foo` as a task, you would get _mostly_ the same result as if you used `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.create_task(foo()) - await bar() - -asyncio.run(main()) ----- - -As you can see, "World" prints in a couple of seconds, and 3 seconds later "Hello" prints, for a total execution time of 5 seconds. With that being said, something is odd with our output. - -.Output ----- -World -CPU times: user 2.57 ms, sys: 1.06 ms, total: 3.63 ms -Wall time: 2 s -Hello ----- - -It says that it executed in 2 seconds, not 5. In addition, "Hello" prints _after_ Jupyter says our execution completed. Why? Well, if you read https://docs.python.org/3/library/asyncio-task.html#creating-tasks[here], you will see that `asyncio.create_task` takes a coroutine (in our case the output from `foo()`), and schedules it as a _task_ in the event loop returned by `asyncio.get_running_loop()`. This is the critical part -- it is scheduling the coroutine created by `foo()` to run on the same event loop that Jupyter Lab is running on, so even though our event loop created by `asyncio.run` stopped execution, `foo` ran until complete instead of cancelling as soon as `bar` was awaited! To observe this, open a terminal and run the following code to launch a Python interpreter: - -[source,bash] ----- -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load python/f2021-s2022-py3.9.6 -python3 ----- - -Then, in the Python interpreter, run the following. - -[NOTE] -==== -You may need to type it out manually. -==== - -[source,python] ----- -import asyncio - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.create_task(foo()) - await bar() - -asyncio.run(main()) ----- - -As you can see, the output is _not_ the same as when you run it from _within_ the Jupyter notebook. Instead of: - -.Output ----- -World -CPU times: user 2.57 ms, sys: 1.06 ms, total: 3.63 ms -Wall time: 2 s -Hello ----- - -You should get: - -.Output ----- -World ----- - -This is because this time, there is no confusion on which event loop to use when scheduling a task. Once we reach the end of `main`, the event loop is stopped and any tasks scheduled are terminated -- even if they haven't finished (like `foo`, in our example). If you wanted to modify `main` in order to wait for `foo` to complete, you could _await_ the task _after_ you await `bar()`. - -[IMPORTANT] -==== -Note that this will work: - -[source,python] ----- -async def main(): - task = asyncio.create_task(foo()) - await bar() - await task ----- - -But this, will not: - -[source,python] ----- -async def main(): - task = asyncio.create_task(foo()) - await task - await bar() ----- - -The reason is that as soon as you call `await task`, `main` is suspended until the task is complete, which prevents both coroutines from executing concurrently (and we miss out on our 2 second performance gain). If you wait to call `await task` _after_ `await bar()`, our task (`foo`) will continue to run concurrently as a task on our event loop along side `bar` (and we get our 2 second performance gain). In addition, `asyncio.run` will wait until `task` is finished before terminating execution, because we awaited it at the very end. -==== - -In the same way that `asyncio.create_task` schedules the coroutines as tasks on the event loop (immediately), so does `asyncio.gather`. In a previous example, we _awaited_ our call to `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -.Output ----- -World -Hello -CPU times: user 3.41 ms, sys: 1.96 ms, total: 5.37 ms -Wall time: 5.01 s ----- - -This is critical, otherwise, `main` would execute immediately and terminate before either `foo` or `bar` finished. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -.Output ----- -CPU times: user 432 µs, sys: 0 ns, total: 432 µs -Wall time: 443 µs -World -Hello ----- - -As you can see, since we did not await our `asyncio.gather` call, `main` ran and finished immediately. The only reason "World" and "Hello" printed is that they finished running on the event loop that Jupyter uses instead of the loop we created using our call to `asyncio.run`. If you were to run the code from a Python interpreter instead of from Jupyter Lab, neither "World" nor "Hello" would print. - -[CAUTION] -==== -I know this is a _lot_ to take in for a single question. If you aren't quite following at this point I'd highly encourage you to post questions in Piazza before continuing, or rereading things until it starts to make sense. -==== - -Modify your `simulate_story` function from question (1) so that `sharpen_pencil` runs concurrently with `write_quarter`, and the total execution time is about 20 seconds. - -[IMPORTANT] -==== -Some important notes to keep in mind: - -- Make sure that the "rules" are still followed. You can still only write 1 quarter of the report at a time. -- Make sure that your code awaits what needs to be awaited -- even if _technically_ those tasks would execute prior to `simulate_story` finishing. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -That last question was quite a bit to take in! It is ok if it hasn't all clicked! I'd encourage you to post questions in Piazza, and continue to mess around with simple async examples until it makes more sense. It will help us explain things better and improve things for the next group of students! - -There are a couple of straightforward ways you could solve the previous question (well technically there are even more). One way involves queuing up the `sharpen_pencil` coroutines as tasks that run concurrently, and awaiting them at the end. The other involves using `asyncio.gather` to queue up select `write_quarter` and `sharpen_pencil` tasks to run concurrently, and await them. - -While both of these methods do a great job simulating our simple story, there may be instances where a greater amount of control may be needed. In such circumstances, https://docs.python.org/3/library/asyncio-sync.html[the Python synchronization primitives] may be useful! - -Read about the https://docs.python.org/3/library/asyncio-sync.html#asyncio.Event[Event primitive], in particular. This primitive allows us to notify one or more async tasks that _something_ has happened. This is particularly useful if you want some async code to wait for other async code to run before continuing on. Cool, how does it work? Let's say I want to yell, but before I yell, I want the megaphone to be ready. - -First, create an event, that represents some event. - -[source,python] ----- -import asyncio - -async def yell(words, wait_for): - print(f"{words.upper()}") - -# create an event -megaphone_ready = asyncio.Event() ----- - -To wait to continue until the event has occurred, you just need to `await` the coroutine created by calling `my_event.wait()`. So in our case, we can add `my_event.wait()` before we yell in the `yell` function. - -[source,python] ----- -async def yell(words, wait_for): - await wait_for.wait() - print(f"{words.upper()}") ----- - -By default, our `Event` is set to `False` since the event has _not_ occurred. The `yell` task will continue to await our event until the event is marked as _set_. To mark our event as set, we would use the `set` method. - -[source,python] ----- -import asyncio - -async def yell(words, wait_for): - await wait_for.wait() - print(f"{words.upper()}") - -async def main(): - megaphone_ready = asyncio.Event() # by default, it is not ready - - # create our yell task. Remember, tasks are immediately scheduled - # on the event loop to run. At this point, the await wait_for.wait() - # part of our yell function will prevent the task from moving - # forward to the print statement until the event is set. - yell_task = asyncio.create_task(yell("Hello", megaphone_ready)) - - # let's say we have to dust off the megaphone for it to be ready - # and it takes 1 second to do so - await asyncio.sleep(1) - - # now, since we've dusted off the megaphone, we can mark it as ready - megaphone_ready.set() - - # at this point in time, the await wait_for.wait() part of our code - # from the yell function will be complete, and the yell function - # will move on to the print statement and actually yell - - # Finally, we want to await for our yell_task to finish - # if our yell_task wasn't a simple print statement, and tooks a few seconds - # to finish, this await would be necessary for the main function to run - # to completion. - await yell_task - -asyncio.run(main()) ----- - -Consider each of the following as a separate event: - -- Writing the first quarter of the report -- Writing the second quarter of the report -- Writing the third quarter of the report -- Writing the fourth quarter of the report -- Sharpening the first pencil -- Sharpening the second pencil - -Use the `Event` primitive to make our code run as intended, concurrently. Use the following code as a skeleton for your solution. Do **not** modify the code, just make additions. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def write_quarter(current_event, events_to_wait_for = None): - # TODO: if events_to_wait_for is not None - # loop through the events and await them - - await asyncio.sleep(5) - - # TODO: at this point, the essay quarter has - # been written and we should mark the current - # event as set - - -async def sharpen_pencil(current_event, events_to_wait_for = None): - # TODO: if events_to_wait_for is not None - # loop through the events and await them - - await asyncio.sleep(2) - - # TODO: at this point, the essay quarter has - # been written and we should mark the current - # event as set - - -async def simulate_story(): - - # TODO: declare each of the 6 events in our story - - # TODO: add each function call to a list of tasks - # to be run concurrently. Should be something similar to - # tasks = [write_quarter(something, [something,]), ...] - tasks = [] - - await asyncio.gather(*tasks) - -asyncio.run(simulate_story()) ----- - -[TIP] -==== -The `current_event` is passed so we can mark it as set once the event has occurred. -==== - -[TIP] -==== -The `events_to_wait_for` is passed so we can await them before continuing. This ensures that we don't try and sharpen the first pencil until after we've written the first quarter of the essay. Or ensures that we don't write the third quarter of the essay until after the first pencil has been sharpened. -==== - -[TIP] -==== -The code you will add to `write_quarter` will be identical to the code you will add to `sharpen_pencil`. -==== - -[TIP] -==== -The `events_to_wait_for` is expected to be iterable (a list). Make sure you pass a single event in a list if you only have one event to wait for. -==== - -[TIP] -==== -It should take about 20 seconds to run. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -While it is certainly useful to have some experience with async programming in Python, the context in which most data scientists will deal with it is writing APIs using something like `fastapi`, where a deep knowledge of async programming isn't really needed. - -What _will_ be pretty common is the need to speed up code. One of the primary ways to do that is to parallelize your code. - -In the previous project, in question (5), you described an operation that you could do to the entire flights dataset (`/depot/datamine/data/flights/subset`). In this situation, where you have a collection of neatly formatted datasets, a good first step would be to write a function that accepts a two paths as arguments. The first path could be the absolute path to the dataset to be processed. The second path could be the absolute path of the intermediate output file. Then, the function could process the dataset and output the intermediate calculations. - -For example, let's say you wanted to count how many flights in the dataset as a whole. Then, you could write a function to read in the dataset, count the flights, and output a file containing the number of flights. This would be easily parallelizable because you could process each of the files individually, in parallel, and at the very end, sum up the data in the output file. - -Write a function that is "ready" to be parallelized, and that follows the operation you described in question (5) in the previous project. Test out the function on at least 2 of the datasets in `/depot/datamine/data/flights/subset`. - -[TIP] -==== -In the next project, we will parallelize and run some benchmarks. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project03.adoc deleted file mode 100644 index f09a14c0b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project03.adoc +++ /dev/null @@ -1,220 +0,0 @@ -= STAT 39000: Project 3 -- Spring 2022 - -**Motivation:** When working with large amounts of data, it is sometimes critical to take advantage of modern hardware and _parallelize_ the computation. Depending on the problem, parallelization can massively reduce the amount of time to process something. - -**Context:** This is the second in a series of 3 projects that explore sync vs. async, parallelism, and concurrency. For some, the projects so far may have been a bit intense. This project will slow down a bit, run some fun experiments, and try to _start_ clarifying some confusion that is sometimes present with terms like threads, concurrency, parallelism, cores, etc. - -**Scope:** Python, threads, parallelism, concurrency, joblib - -.Learning Objectives -**** -- Distinguish between threads and processes at a high level. -- Demonstrate the ability to parallelize code. -- Identify and approximate the amount of time to process data after parallelization. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/flights/subset/*.csv` - -== Questions - -=== Question 1 - -`joblib` is a Python library that makes many parallelization tasks easier. Run the following code in three separate code cells. But, before you do, look at the code and write down approximately how much time you think each cell will take to run. 1 call to `run_for` will take roughly 2.25 seconds on a Brown cpu. Take note that we currently have 1 cpu for this notebook. - -[source,python] ----- -import time -import joblib -from joblib import Parallel -from joblib import delayed - -def run_for(): - var = 0 - while var < 10**7.5: - var += 1 - - print(var) ----- - -[source,ipython] ----- -%%time -test = [run_for() for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(run_for)() for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(run_for)() for i in range(4)) ----- - -Were you correct? Great! We only have 1 cpu, so regardless if we chose to use 2 threads or 2 processes, only 1 cpu would be used and 1 thing executed at a time. - -**threading:** This backend for `joblib` will use threads to run the tasks. Even though we only have a single cpu, we can still create as many threads as we want, however, due to Python's GIL (Global Interpreter Lock), only 1 thread can execute at a time. - -**multiprocessing:** This backend for `joblib` will use processes to run the tasks. In the same way we can create as many threads as we want, we can also create as many processes as we want. A _process_ is created by an os function called `fork()`. A _process_ can have 1 or more _threads_ or _threads of execution_, in fact, typically a process must have at least 1 _thread_. _Threads_ are much faster and take fewer resources to create. Instead of `fork()` a thread is created by `clone()`. A single cpu can have multiple processes or threads, but can only execute 1 task at a time. As a result, we end up with the same amount of time used to run. - -When writing a program, you could make your program create a process that spawns multiple threads. Those threads could then each run in parallel, 1 per cpu. Alternatively, you could write a program that has a single thread of execution, and choose to execute the program _n_ times creating _n_ processes that each run in parallel, 1 per cpu. The advantage of the former is that threads are lighter weight and take less resources to create, an advantage of the latter is that you could more easily distribute such a program onto many systems to run without having to worry about how many threads to spawn based on how many cpus you have available. - -Okay, let's take a look at this next example. Run the following (still with just 1 cpu). - -[source,ipython] ----- -%%time -test = [time.sleep(2) for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(time.sleep)(2) for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(time.sleep)(2) for i in range(4)) ----- - -Did you get it right this time? If not, it is most likely that you thought all 3 would take about 8 seconds. We only have 1 cpu, after all. Let's try to explain. - -**threading:** Like we mentioned before, due to Python's GIL, we can only execute 1 thread at a time. So why did our example only take about 2 seconds if only 1 thread can execute at a time? `time.sleep` is a function that will release Python's GIL (Global Interpreter Lock) because it is not actually utilizing the CPU while sleeping. It is _not_ the same as running an intensive loop for 2 seconds (like our previous example). Therefore the first thread can execute, the GIL is released, the second thread begins execution, rinse and repeat. The only execution that occurs is each thread consecutively starting `time.sleep`. Then, after about 2 seconds all 4 `time.sleep` calls are done, even though the cpu was not utilized much at all. - -**multiprocessing:** In this case, we are bypassing the restrictions that the GIL imposes on threads, BUT, `time.sleep` still doesn't need the cpu cycles to run, so the end result is the same as the threading backend, and all calls "run" at the same time. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Okay, let's try something! Save your notebook (and output from question 1), and completely close and delete your ondemand session. Then, launch a new notebook, but instead of choosing 1 core, choose 4. Run the following code, but before you do, guess how much time each will take to run. - -[source,python] ----- -import time - -def run_for(): - var = 0 - while var < 10**7.5: - var += 1 - - print(var) ----- - -[source,ipython] ----- -%%time -test = [run_for() for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(run_for)() for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(run_for)() for i in range(4)) ----- - -How did you do this time? You may or may not have guessed, but the threading version took the same amount of time, but the multiprocessing backend was just about 4 times faster! What gives? - -Whereas Python's GIL will prevent more than a single thread from executing at a time, when `joblib` uses processes, it is not bound by the same rules. A _process_ is something created by the operating system that has its own address space, id, variables, heap, file descriptors, etc. As such, when `joblib` uses the multiprocessing backend, it creates new Python processes to work on the tasks, bypassing the GIL because it is _n_ separate processes and Python instances, not a single Python instance with _n_ threads of execution. - -In general, Python is not a good choice for writing a program that is best written using threads. However, you _can_ write code, especially using certain package (including numpy) that release the GIL. - -For example, check out the results of the following code. - -[source,python] ----- -def no_gil(): - x = np.linalg.inv(np.random.normal(0, 1, (3000,3000))) ----- - -[source,ipython] ----- -%%time -test = [no_gil() for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(no_gil)() for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(no_gil)() for i in range(4)) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, great, let me parallelize something! Okay, sounds good. - -The task is to count all of the lines in all of the files in `/depot/datamine/data/flights/subset/*.csv`, from the `1987.csv` to `2008.csv`, excluding all other csvs. - -First, write a non-parallelized solution that opens each file, counts the lines, adds the count to a total, closes the file, and repeats for all files. At the end, print the total number of lines. Put the code into a code cell and time the code cell using `%%time` magic. - -Now, write a parallelized solution that does the same thing. Put the code intoa code cell and time the code cell using `%%time` magic. - -Make sure you are using a Jupyter Lab session with 4 cores. - -[TIP] -==== -Some optional tips: - -- Write a function that accepts an absolute path to a file (as a string), as well as an absolute path to a file in directory (as a string). -- The function should output the count of lines from the file represented by the first argument in the file specified in the second argument. -- Parallelize the function using `joblib`. -- After the `joblib` job is done, cycle through all of the output files, sum the counts, and print the total. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Parallelize the task and function that you have been writing about in the past 2 projects. If you are struggling or need help, be sure to ask for help in Piazza! If after further thinking, what you specified in the previous project is not easily parallelizable, feel free to change the task to some other, actually parallelizable task! - -Please time the task using `%%time` magic, both _before_ and _after_ parallelizing the task -- after all, its not any fun if you can't see the difference! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project04.adoc deleted file mode 100644 index 41bb79474..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project04.adoc +++ /dev/null @@ -1,141 +0,0 @@ -= STAT 39000: Project 4 -- Spring 2022 - -== snow way, that is pretty quick - -**Motivation:** We got some exposure to parallelizing code in the previous project. Let's keep practicing in this project! - -**Context:** This is the last in a series of projects focused on parallelizing code using `joblib` and Python. - -**Scope:** Python, joblib - -.Learning Objectives -**** -- Demonstrate the ability to parallelize code using `joblib`. -- Identify and approximate the amount of time to process data after parallelization. -- Demonstrate the ability to scrape and process large amounts of data. -- Utilize `$SCRATCH` to store temporary data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Check out the data located here: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ - -Normally, you are provided with clean, or semi clean sets of data, you read it in, do something, and call it a day. In this project, you are going to go get your own data, and although the questions won't be difficult, they will have less guidance than normal. Try and tap in to what you learned in previous projects, and of course, if you get stuck just shoot us a message in Piazza and we will help! - -As you can see, the yearly datasets vary greatly in size. What is the average size in MB of the datasets? How many datasets are there (excluding non year datasets)? What is the total download size (in GB)? Use the `requests` library and either `beautifulsoup4` or `lxml` to scrape and parse the webpage, so you can calculate these values. - -[CAUTION] -==== -Make sure to exclude any non-year files. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The `1893.csv.gz` dataset appears to be about 1/1000th of our total download size -- perfect! Use the `requests` package to download the file, write the file to disk, and time the operations using https://pypi.org/project/block-timer/[this package] (which is already installed). - -If you had a single CPU, approximately how long would it take to download and write all of the files (in minutes)? - -[TIP] -==== -The following is an example of how to write a scraped file to disk. - -[source,python] ----- -resp = requests.get(url) -with open("my_file.csv.gz", "wb") as f: - f.write(resp.content) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You can request up to 24 cores using OnDemand and our Jupyter Lab app. Save your work and request a new session with a minimum of 4 cores. Write parallelized code that downloads _all_ of the datasets (from 1750 to 2022) into your `$SCRATCH` directory. Before running the code, estimate the total amount of time this _should_ take. - -[WARNING] -==== -If you request more than 4 cores, **please** make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. -==== - -[CAUTION] -==== -There aren't datasets for 1751-1762 -- so be sure to handle this. Perhaps you could look at the `response.status_code` and make sure it is 200? -==== - -Time how long it takes to download and write all of the files. Was your estimation close (within a minute or two)? - -[TIP] -==== -Remember, your `$SCRATCH` directory is `/scratch/brown/ALIAS` where `ALIAS` is your username. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -In a previous question, I provided you with the code to actually extract and save content from our `requests` response. This is the sort of task that may not be ovious on how to do. Learning how to use search engines like Google or Kagi to figure this out is critical. - -Figure out how to extract the csv file from each of the datasets using Python. Write parallelized go that loops through and extracts all of the data. The end result should be 1 csv file per year. Like in the previous question, measure the time it takes to extract 1 csv, and attempt to estimate how long it should take to extract all of them. Time the extraction and compare your estimation to reality. Were you close? - -[WARNING] -==== -If you request more than 4 cores, **please** make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Unzipped, your datasets total X! That is a lot of data! - -You can read https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] about what the data means. - -. 11 character station ID -. 8 character date in YYYYMMDD format -. 4 character element code (you can see the element codes https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] in section III) -. value of the data (varies based on the element code) -. 1 character M-flag (10 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character Q-flag (14 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character S-flag (30 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 4 character observation time (HHMM) (0700 = 7:00 AM) -- may be blank - - -It has been a snowy week, use your parallelization skills to figure out _something_ about snowfall. For example, maybe you want to find the last time or the last year which X amount of snow fell. Or maybe you want to find the station id for the location who has had the most instances of over X amount of snow. Get creative! You may create plots to supplement your work (if you want). - -Any good effort will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project05.adoc deleted file mode 100644 index b3b8358f5..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project05.adoc +++ /dev/null @@ -1,410 +0,0 @@ -= STAT 39000: Project 5 -- Spring 2022 - -**Motivation:** In this project we will slowly get familiar with SLURM, the job scheduler installed on our clusters at Purdue, including Brown. - -**Context:** This is the first in a series of 3 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, unix, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/coco/unlabeled2017/*.jpg` - -== Questions - -=== Question 1 - -[IMPORTANT] -==== -This project (and the next) will have different types of deliverables. Each question will result in an entry in a Jupyter notebook, and/or 1 or more additional Python and/or Bash scripts. - -To properly save screenshots in your Jupyter notebook, follow the guidelines xref:templates.adoc#including-an-image-in-your-notebook[here]. Images that don't appear in your notebook in Gradescope will not get credit. -==== - -[WARNING] -==== -When you start your JupyterLab session this week, BEFORE you start your session, please change "Processor cores requested" from 1 to 4. We will use 4 processing cores this week. We might change this back to 1 processing core for the Jupyter Lab session next week; please stay tuned. -==== - -Most of the supercomputers here at Purdue contain one or more frontends. Users can log in and submit jobs to run on one or more backends. To submit a job, users use SLURM. - -SLURM is a job scheduler found on about 60% of the top 500 supercomputers.footnote:[https://en.wikipedia.org/wiki/Slurm_Workload_Manager[https://en.wikipedia.org/wiki/Slurm_Workload_Manager]] In this project (and the next) we will learn about ways to schedule jobs on SLURM, and learn the tools used. - -Let's get started by using a script called `sinteractive` written by Lev Gorenstein, here at Purdue. A brief explanation is that `sinteractive` gets some resources (think memory and cpus), and logs you into that "virtual" system. - -Open a terminal and give it a try. - -[source,bash] ----- -sinteractive -A datamine -n 3 -c 1 -t 00:05:00 --mem=4000 ----- - -After some output, you should notice that your shell changed. Type `hostname` followed by enter to see that your host has changed from `brown-feXX.rcac.purdue.edu` to `brown-aXXX.rcac.purdue.edu`. You are in a different system! Very cool! - -To find out what the other options are read https://slurm.schedmd.com/salloc.html - -- The `-A datamine` option could have also been written `--account=datamine`. This indicates which account to use when allocating the resources (memory and cpus). You can also think of this as a "queue" or "the datamine queue". Jobs submitted using this option will use the resources we pay for. Only users with permissions can submit to our queue. -- The `-n 3` option could have also been written `--ntasks=3`. This indicates how many cpus/tasks we may need for the job. -- The `-c 1` option could have also been written `--cpus-per-task=1`. This indicates the number of processors per task. -- The `-t 00:05:00` option could have also been written `--time=00:05:00`. This indicates how long the job may run for. If the time exceeds the time limit, the job is killed. -- The `--mem=4000` option indicates how much memory (in MB) we may need for the job. If you want to specify the amount of memory per task, you could use `--mem-per-task`. - -[NOTE] -==== -Another common option is `-N` or `--nodes`. This indicates how many nodes we may need for the job. A node is a single backend computer. If `-N` is unspecified, the default behavior is to allocate enough nodes to satisfy the requirements of the `-n` and `-c` options. For this course we will break our jobs down into a certain number of tasks, so using the `-n` option makes more sense, and is more flexible as tasks can be distributed on many nodes. -==== - -To confirm, use the following script to see how much memory and cpus we have available to us. - -[source,python] ----- -#!/usr/bin/python3 - -from pathlib import Path - -def main(): - with open("/proc/self/cgroup") as file: - for line in file: - if 'cpuset' in line: - cpu_loc = "cpuset" + line.split(":")[2].strip() - - if 'memory' in line: - mem_loc = "memory" + line.split(":")[2].strip() - - base_loc = Path("/sys/fs/cgroup/") - with open(base_loc / cpu_loc / "cpuset.cpus") as file: - num_cpus = len(file.read().strip().split(",")) - print(f"CPUS: {num_cpus}") - - with open(base_loc / mem_loc / "memory.limit_in_bytes") as file: - mem_in_bytes = int(file.read().strip()) - print(f"Memory: {mem_in_bytes/1024**2} Mbs") - -if __name__ == "__main__": - main() ----- - -To use it. - -[source,bash] ----- -./get_info.py ----- - -For this question, add a screenshot of running `hostname` on the `sinteractive` session, as well as `./get_info.py` to your notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -`sinteractive` can be useful, but most of the time we want to run a job. - -Before we get started, read the answer to https://stackoverflow.com/questions/46506784/how-do-the-terms-job-task-and-step-relate-to-each-other[this] stackoverflow post. In many instances, it is easiest to use 1 cpu per task, and let SLURM distribute those tasks to run. In this course, we will use this simplified model. - -So what is the difference between `srun` and `sbatch`? https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters[This] stackoverflow post does a pretty great job explaining. You can think of `sbatch` as the tool for submitting a job script for execution, and `srun` as the tool to submit a job to run. We will test out both! - -In the previous question, we used `sinteractive` to get the resources, hop onto the system, and run `hostname` along with out `get_info.py` script. - -Use `srun` to run our `get_info.py` script, to better understand how the various options work. Try and guess the results of the script for each configuration. - -[TIP] -==== -Be sure to give you `get_info.py` script execution permissions if you haven't already. - -[source,bash] ----- -chmod +x get_info.py ----- -==== - -.configurations to try ----- -srun -A datamine -n 2 -c 1 -t 00:00:05 --mem=4000 $HOME/get_info.py -srun -A datamine -n 2 -c 1 -t 00:00:05 --mem-per-cpu=4000 $HOME/get_info.py -srun -A datamine -N 1 -n 2 -c 1 -t 00:00:05 --mem-per-cpu=1000 $HOME/get_info.py -srun -A datamine -N 2 -n 2 -c 1 -t 00:00:05 --mem-per-cpu=1000 $HOME/get_info.py -srun -A datamine -N 2 -n 2 -c 1 -t 00:00:05 --mem=1000 $HOME/get_info.py -srun -A datamine -N 2 -n 3 -c 1 -t 00:00:05 --mem=1000 $HOME/get_info.py -srun -A datamine -N 2 -n 3 -c 1 -t 00:00:05 --mem-per-cpu=1000 $HOME/get_info.py -srun -A datamine -N 2 -n 3 -c 1 -t 00:00:05 --mem-per-cpu=1000 $HOME/get_info.py > $CLUSTER_SCRATCH/get_info.out ----- - -[NOTE] -==== -Check out the `get_info.py` script. SLURM uses cgroups to manage resources. Some of the more typical commands used to find the number of cpus and amount of memory don't work accurately when "within" a cgroup. This script figures out which cgroups you are "in" and parses the appropriate files to get your resource limitations. -==== - -Reading the explanation from SLURM's website is not enough to understand, running the configurations will help your understanding. If you have simple, parallel processes, that doesn't need to have any shared state, you can use a single `srun` per task. Each with `--mem-per-cpu` (so memory availability is more predictable), `-n 1`, `-c 1`, followed by `&` (just a reminder that `&` at the end of a bash command puts the process in the background). - -Reading the information about cgroups may lead you to wonder if the RCAC puts you into a cgroup when you are SSH'd into a frontend. Use our `get_info.py` script, along with other unix commands, to determine if you are in a cgroup. If you are in a cgroup, how many cpus and memory do you have? - -[TIP] -==== -If `get_info.py` does not match the resources you get using `free -h` or `lscpu` (for example), you are in a cgroup. -==== - -Finally, take note of the last configuration. What is the $CLUSTER_SCRATCH environment variable? - -For the answer to this question: - -. Add a screenshot of the results of some (not all) of you running the `get_info.py` script in the `srun` commands. -. Write 1-2 sentences about any observations. -. Include what the `$CLUSTER_SCRATCH` environment variable is. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The following is a solid template for a job script. - -.job script template ----- -#!/bin/bash -#SBATCH --account=datamine -#SBATCH --job-name=serial_job_test # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=me@purdue.edu # Where to send mail -#SBATCH --ntasks=1 # Number of tasks (total) -#SBATCH --cpus-per-task=1 # Number of processors per task -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -echo "srun commands and other bash below" -wait ----- - -If we we put all of our `srun` commands from the previous question into the same script, we want the output for each `srun` to be put into a uniquely named file, to bea able to see the result of each command. - -Replace the `echo` command in the job script with our `srun` commands from the previous question. Also, direct the output from each command into a uniquely named file. Make sure to end each `srun` line in `&`. Make suret to specify the correct total of tasks. - -To submit the job, run the following. - -[source,bash] ----- -sbatch my_job.sh ----- - -If the output files are not what you expected, copy your batch script and add the `--exclusive` flag to each `srun` command then run it again. Read about the `--exclusive` option https://slurm.schedmd.com/srun.html[here] and do your best to explain what is happening. - -To answer this question, 1. Submit both job scripts, 2. A markdown cell containing your explanation of what happened before `--exclusive` was added to each `srun` command. 3. A markdown cell describing some of your outputs for each of the batch scripts' outputs. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -The more you practice the clearer your understanding will be. So we will be putting our new skills to use to solve a problem. - -We begin with a dataset full of images: `/depot/datamine/data/coco/unlabeled2017/*.jpg`. - -We know a picture of Dr. Ward is (naturally) included in the folder. The problem is, Dr. Ward is sneaky and he has added a duplicate image of himself in our dataset. This duplicate could cause problems and we need a clean dataset. - -It is time consuming and not best practice to manually go through the entire dataset to find the duplicate. Thinking back to some of the past work, we remember that a hash algorithm is a good way to identify the duplicate image. - -Below is code you could use to produce a hash of an image. - -[source,python] ----- -with open("/path/to/myimage.jpg", "rb") as f: - print(hashlib.sha256(f.read()).hexdigest()) ----- - -[NOTE] -==== -In general a hash function, is a function that takes an input and produces a unique "hash", or alphanumeric string. Meaning if you find two identical hashes, most likely you can assume that the inputs are identical. -==== - -By finding the hash of all of the images in the first folder, then using sets to quickly find the duplicate image. You can write a Python script that outputs a file containing the hash of each image - -An example: -a file called `000000000013.jpg` with the contents `7ad591844b88ee711d1eb60c4ee6bb776c4795e9cb4616560cb26d2799493afe`. - - -Parallelize the file creating and search process will make finding the duplicate faster. - -[source,python] ----- -#!/usr/bin/python3 - -import os -import sys -import hashlib -import argparse - - -def hash_file_and_save(files, output_directory): - """ - Given an absolute path to a file, generate a hash of the file and save it - in the output directory with the same name as the original file. - """ - - for file in files: - file_name = os.path.basename(file) - file_hash = hashlib.sha256(open(file, "rb").read()).hexdigest() - output_file_path = os.path.join(output_directory, file_name) - with open(output_file_path, "w") as output_file: - output_file.write(file_hash) - - -def main(): - - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest='command') - hash_parser = subparsers.add_parser("hash", help="generate and save hash") - hash_parser.add_argument("files", help="files to hash", nargs="+") - hash_parser.add_argument("-o", "--output", help="directory to output file to", required=True) - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "hash": - hash_file_and_save(args.files, args.output) - -if __name__ == "__main__": - main() ----- - -Quickly recognizing that it is not efficient to have an `srun` command for each image, you'd have to programmatically build the job script, also the script runs quickly so there would be a rapid build up wasted time with overhead related to calling `srun`, allocating resources, etc. Instead for efficency create a job script that splits the images into groups of 12500 or less. Then, within 10 `srun` commands you will be able to use the provided Python script to process the 12500 images. - -The Python script works as follows. - -[source,bash] ----- -./hash.py hash --output /path/to/outputfiles/ /path/to/image1.jpg /path/to/image2.jpg ----- - -[TIP] -==== -https://stackoverflow.com/questions/21668471/bash-script-create-array-of-all-files-in-a-directory[This] stackoverflow post shows how to get a Bash array full of absolute paths to files in a folder. -==== - -[TIP] -==== -To pass many arguments (_n_ arguments) to our Python script, you can `./hash.py hash --output /path/to/outputfiles/ ${my_array[@]}`. -==== - -[TIP] -==== -https://stackoverflow.com/questions/23747612/how-do-you-break-an-array-in-groups-of-n[This] stackoverflow post shows how to break an array of values into groups of _x_. -==== - -Create a job script that processes all of the images in the folder, and outputs the hash of each image into a file with the same name as the original image. Output these files into a folder in `$CLUSTER_SCRATCH`, so, for example, `$CLUSTER_SCRATCH/q4output`. - -[NOTE] -==== -This job took 2 minutes 34 seconds to run. -==== - -Once the images are all hashed, in your Jupyter notebook, write Python code that processes all of the hashes and prints out the name of one of the duplicate images. Display the image in your notebook using the following code. - -[source,python] ----- -from IPython import display -display.Image("/path/to/duplicate_image.jpg") ----- - -To answer this question, submit the functioning job script AND the code in the Jupyter notebook that was used to find (and display) the duplicate image. - -[TIP] -==== -Using sets will help find the duplicate image. One set can store new hashes that haven't yet been seen. The other set can store duplicates, since there is only 1 duplicate you can immediately return the filename when found! - -https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them[This] stackoverflow post shares some ideas to manage this. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -.Solution -==== -.myjob.sh -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine # Queue -#SBATCH --job-name=kevinsjob # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=kamstut@purdue.edu # Where to send mail -#SBATCH --time=00:30:00 -#SBATCH --ntasks=10 # Number of tasks (total) -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -arr=(/depot/datamine/data/coco/unlabeled2017/*) - -for((i=0; i < ${#arr[@]}; i+=12500)) -do - part=( "${arr[@]:i:12500}" ) - srun -A datamine --exclusive -n 1 --mem-per-cpu=200 $HOME/hash1.py hash --output $CLUSTER_SCRATCH/p4output/ ${part[*]} & -done - -wait ----- - -[source,bash] ----- -sbatch myjob.sh ----- - -[source, python] ----- -from pathlib import Path - -def get_duplicate(path): - path = Path(path) - files = path.glob("*.jpg") - uniques = set() - duplicates = set() - for file in files: - with open(file, 'r') as f: - hsh = f.readlines()[0].strip() - if hsh in uniques: - duplicates.add(file) - return(file) - else: - uniques.add(hsh) - -file = get_duplicate("/scratch/brown/kamstut/p4output/") - -from IPython.display import Image -Image(filename=f"/depot/datamine/data/coco/unlabeled2017/{file.name}") ----- - -.Output ----- - ----- -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project06.adoc deleted file mode 100644 index 53edcd975..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project06.adoc +++ /dev/null @@ -1,328 +0,0 @@ -= STAT 39000: Project 6 -- Spring 2022 - -**Motivation:** In this project we will continue to get familiar with SLURM, the job scheduler installed on our clusters at Purdue, including Brown. - -**Context:** This is the second in a series of (now) 4 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, unix, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -- Understand the differences between `srun` and `sbatch` commands. -- Predict the resources (cpus and memory) an `srun` job will use based on the arguments and context. -- Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/coco/attempt02/*.jpg` - -== Questions - -=== Question 1 - -[IMPORTANT] -==== -This project, and the next, will have a variety of different types of deliverables. Ultimately, each question will result in some entry in a Jupyter notebook, and/or 1 or more additional Python and/or Bash scripts. In addition, to properly save screenshots in your Jupyter notebook, please follow the guidelines xref:templates.adoc#including-an-image-in-your-notebook[here]. Images that don't appear in your notebook in Gradescope will not get credit. -==== - -In project 5, question 2, we asked you to test out a variety of `srun` commands with variations in the options. As you are probably now well-aware -- it can be difficult to understand what combination of parameters are needed. With that being said, in _this_ course, we will focus on jobs that can be perfectly or embarassingly parallel, and single core single threaded jobs. So, the following job script is a _safe_ and _effective_ way to break your jobs up. - -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine -#SBATCH --job-name=serial_job_test # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=me@purdue.edu # Where to send mail -#SBATCH --ntasks=3 # Number of tasks (total) -#SBATCH --cpus-per-task=1 # Number of cores per task -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command & -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command & -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 some_command & - -wait ----- - -Just be sure to modify your job script `ntasks` and the amount of time and memory you need for each job step. - - -[NOTE] -==== -Remember, you use `sbatch` to submit a _job_. Your job will have _steps_ (each `srun` line). Your steps will have _tasks_ (in our case, each `srun` will run a single task). -==== - -[NOTE] -==== -To add to the difficulty you maybe had understanding the various options available to you, if you used a terminal from with Jupyter Lab, you were technically already in a SLURM job with `-c 4`! In this project, we will learn to side-step this complication from within the Jupyter Lab environment so it is equivalent to SSH'ing into a frontend node in a fresh session. -==== - -When inside a SLURM job, a variety of environment variables are set that alters how `srun` behaves. If you open a terminal from within Jupyter Lab and run the following, you will see. - -[source,bash] ----- -env | grep -i slurm ----- - -These variables altered the behavior of `srun`. We _can_ however, _unset_ these variables, and the behavior will revert to the default behavior. In your terminal, run the following. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Confirm that the environment variables are unset by running the following. - -[source,bash] ----- -env | grep -i slurm ----- - -Great! Now, we can work in our nice Jupyter Lab environment without any concern that SLURM environment variables are changing any behaviors. Let's test it out with something _actually_ predictable. - -.get_info.py -[source,python] ----- -#!/usr/bin/python3 - -import time -import socket -from pathlib import Path -import datetime - -def main(): - - print(f"started: {datetime.datetime.now()}") - print(socket.gethostname()) - - with open("/proc/self/cgroup") as file: - for line in file: - if 'cpuset' in line: - cpu_loc = "cpuset" + line.split(":")[2].strip() - - if 'memory' in line: - mem_loc = "memory" + line.split(":")[2].strip() - - base_loc = Path("/sys/fs/cgroup/") - with open(base_loc / cpu_loc / "cpuset.cpus") as file: - num_cpus = len(file.read().strip().split(",")) - print(f"CPUS: {num_cpus}") - - with open(base_loc / mem_loc / "memory.limit_in_bytes") as file: - mem_in_bytes = int(file.read().strip()) - print(f"Memory: {mem_in_bytes/1024**2} Mbs") - - time.sleep(3) - print(f"ended: {datetime.datetime.now()}") - -if __name__ == "__main__": - main() ----- - -.my_job.sh -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine -#SBATCH --job-name=serial_job_test # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=me@purdue.edu # Where to send mail -#SBATCH --ntasks=3 # Number of tasks (total) -#SBATCH --cpus-per-task=1 # Number of cores per task -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 1.txt & -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 2.txt & -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 3.txt & -srun --exclusive -n 1 --mem-per-cpu=1000 -t 00:00:00 $HOME/get_info.py > 4.txt & - -wait ----- - -Place `get_info.py` in your `$HOME` directory and launch the job with the following command. - -[source,bash] ----- -sbatch my_job.sh ----- - -[IMPORTANT] -==== -Make sure to give your `get_info.py` script execute permissions. - -[source,bash] ----- -chmod +x get_info.py ----- -==== - -[IMPORTANT] -==== -Note that there is no `-c` option needed for `srun` commands anymore! In the previous project, you needed to specify `-c 1` (for example) to override the behavior _inherited_ from the "surrounding" job where the setting is `-c 4`. This is no longer needed because we've unset the environment variables that tell `srun` to inherit those settings. -==== - -Check out the contents of `1.txt`, `2.txt`, `3.txt`, and `4.txt`. Explain in as much detail as possible what resources (cpus) were allocated for the _job_, what resources (cpus and memory) were allocated for each _step_, and how the _jobs_ resources (cpus) effected the results of each _step_. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -I _hope_ that the previous question was helpful, and gave you at least 1 reliable way to write job scripts for embarrassingly parallel jobs, where you can predict what will happen. - -[NOTE] -==== -If at this point in time you are wondering "why would we do this when we can just use `joblib` and get 24 cores and power through some job?". The answer is because `joblib` will be limited to the number of cpus on the given node you are running your Python script on. SLURM allows us to allocate _well_ over 24 cpus, and has much higher computing potential! In addition to that, it is (arguably) easier to write a single threaded Python job to run on SLURM, than to parallelize your code using `joblib`. -==== - -In the previous project, you were able to use the sha256 hash to efficiently find the extra image that the trickster Dr. Ward added to our dataset. Dr. Ward, knowing all about hashing algorithms, thinks he has a simple solution to circumventing your work. In the "new" dataset: `/depot/datamine/data/coco/attempt02`, he has modified the value of a single pixel of his duplicate image. - -Re-run your SLURM job from the previous project on the _new_ dataset, and process the results to try to find the duplicate image. Was Dr. Ward's modification successful? Do your best to explain why or why not. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Unfortunately, Dr. Ward was right, and our methodology didn't work. Luckily, there is a cool technique called perceptual hashing that is _almost_ meant just for this! Perceptual hashing is a technique that can be used to know whether or not any two images appear the same, without actually _viewing_ the images. The general idea is this. Given two images that are _essentially_ the same (maybe they have a few different pixels, have been cropped, gone through a filter, etc.), a perceptual hash can give you a very good idea whether the images are the "same" (or close enough). Of course, it is not a perfect tool, but most likely good enough for our purposes. - -To be a little more specific, two images are very likely the same if their perceptual hashes are the same. If two perceptual hashes are the same, their Hamming distance is 0. For example, if your hashes were: `8f373714acfcf4d0` and `8f373714acfcf4d0`, you would the Hamming distance would be 0, because if you convert the hexadecimal values to binary, at each position in the string of 0s and 1s, the values are identical. If 1 of the 0s and 1s didnt match after converting to binary, this would be a Hamming distance of 1. - -Use the https://github.com/JohannesBuchner/imagehash[`imagehash`] library, and modify your job script from the previous project to use perceptual hashing instead of the sha256 algorithm to produce 1 file for each image where the filename remains the same as the original image, and the contents of the file contains the hash. - -[WARNING] -==== -Make sure to clear out your slurm environment variables before submitting your job to run with `sbatch`. If you are submitting the job from a terminal, run the following. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; -sbatch my_job.sh ----- - -If you are in a bash cell in Jupyter Lab, do the same. - -[source,ipython] ----- -%%bash - -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; -sbatch my_job.sh ----- -==== - -[IMPORTANT] -==== -In order for the `imagehash` library to work, we need to make sure the libffi dependency is loaded up. Before executing the hash script in your `srun` command, first prepend `source /etc/profile.d/modules.sh; module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2;`. So, it should look something like: - -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine -...other SBATCH options... - -source /etc/profile.d/modules.sh -module use /scratch/brown/kamstut/tdm/opt/modulefiles -module load libffi/3.4.2 - -srun ... & - -wait ----- - -In order for your hash script to find the `imagehash` library, we need to use our course Python environment. To do that change your shebang line to this monster `#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python`, then, just run the script via `$HOME/my_script.py hash ...` -==== - -[TIP] -==== -To help get you going using this package, let me demonstrate using the package. - -[source,python] ----- -import imagehash -from PIL import Image - -my_hash = imagehash.phash(Image.open("/depot/datamine/data/coco/attempt02/000000000008.jpg")) -print(my_hash) # d16c8e9fe1600a9f -my_hash # numpy array of True (1) and False (0) values -my_hex = "d16c8e9fe1600a9f" -imagehash.hex_to_hash(my_hex) # numpy array of True (1) and False (0) ----- -==== - -[IMPORTANT] -==== -Make sure that you pass the hash as a string to the `output_file.write` method. So something like: `output_file.write(str(file_hash))`. -==== - -[IMPORTANT] -==== -Make sure that once you've written your script, `my_script.sh`, that you submit it to SLURM using `sbatch my_script.sh`, _not_ `./my_script.sh`. -==== - -[TIP] -==== -It would be a good idea to make sure you've modified your hash script to work properly with the `imagehash` library. Test out the script by running the following (assuming your Python code is called `hash.py`, and it is in your `$HOME` directory. - -[source,bash] ----- -$HOME/hash.py hash --output $HOME /depot/datamine/data/coco/attempt02/000000000008.jpg ----- - -This should produce a file, `$HOME/000000000008.jpg`, containing the hash of the image. -==== - -[WARNING] -==== -Make sure your `hash.py` script has execute permissions! - -[source,bash] ----- -chmod +x $HOME/hash.py ----- -==== - -[TIP] -==== -We've now posted the solutions to project 5 question 4. See xref:book:projects:39000-s2022-project05.adoc#question-4[here]. -==== - -Process the results (like in the previous project). Did you find the duplicate image? Explain what you think could have happened. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -What!?! That is pretty cool! You found the "wrong" duplicate image? Well, I guess it is totally fine to find multiple duplicates. Modify the code you used to find the duplicates so it finds all of the duplicates and originals. In total there should be 50. Display 2-5 of the pairs (or triplets or more). Can you see any of the subtle differences? Hopefully you find the results to be pretty cool! If you look, you _will_ find Dr. Wards hidden picture, but you do not have to exhaustively display all 50 images. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project07.adoc deleted file mode 100644 index a1125df18..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project07.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= STAT 39000: Project 7 -- Spring 2022 - -**Motivation:** In this project we will continue to get familiar with SLURM, the job scheduler installed on our clusters at Purdue, including Brown. - -**Context:** This is the third in a series of 4 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, UNIX, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -- Understand the differences between srun and sbatch commands. -- Predict the resources (cpus and memory) an srun job will use based on the arguments and context. -- Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -You are free to use _any_ dataset you would like for this project, even if the data is created or collected by you, with the exception of question 1 which is just a small warm up question that may be fun to play with. - -== Questions - -=== Question 1 - -You've been exposed to a lot about SLURM in a short period of time. For this last question, we are going to let you go at your own pace. - -Think of a problem that you want to solve that may benefit from parallel computing and SLURM. It could be anything: processing many images in some way (counting pixels, applying filters, analyzing for certain objects, etc.), running many simulations to plot, bootstrapping a model to get point estimates for uncertainty quantification, calculating summary information about a large dataset, trying to guess a 6 character password, calculating the Hamming distance between all of the 123k images in the `coco/hashed02` dataset, etc. - -Solve your problem, or make progress towards solving your problem. The following are the loose requirements. As long as you meet these requirements, you will receive full credit. The idea is to get some experience, and have some fun. - -**Requirements:** - -. You must have an introductory paragraph clearly explaining your problem, and how you think using a cluster and SLURM can help you solve it. -. You must submit any and all code you wrote. It could be in any language you want, just put it in a code block in a Markdown cell. -. You must write and submit a job script to be submitted using `sbatch` on SLURM. This could be copy and pasted into a code block in a markdown cell. -. You must measure the time it takes to run your code on a sample of your data, and make a prediction for how long it will take using SLURM, based on the resources you requested in your job script. Write 1-2 sentences explaining how long the sample took and the math you used to predict how long you think SLURM will take. -. You must write 1-2 sentences explaining how close or far away your prediction was from the actual run time. - -The above requirements should be all kept in a Jupyter notebook. The notebook should take advantage of markdown formatting, and narrate a clear story, with a clear objective, and explanations of any struggles or issues you ran into along the way. - -[IMPORTANT] -==== -To not hammer our resources _too_ much, please don't request more than 20 cores, and if you use more than 10 cores, please make sure your jobs don't take _tons_ of time. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project08.adoc deleted file mode 100644 index 6af805945..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project08.adoc +++ /dev/null @@ -1,323 +0,0 @@ -= STAT 39000: Project 8 -- Spring 2022 -:page-mathjax: true - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries -- `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the first in a series of 4-5 projects focused on `pytorch` (and perhaps `JAX`). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show _why_ they are useful, without needing any special math or statistics background. - -**Scope:** Python, pytorch - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the `pytorch` API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use `pytorch` to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/sim/train.csv` -- `/depot/datamine/data/sim/test.csv` - -== Questions - -=== Question 1 - -While data frames are a great way to work with data, they are not the only way. Many high performance parts of code are written using a library like `numpy` or `pytorch`. These libraries are optimized to be extremely efficient with computation. Multiplying, transposing, inversing matrices can take time, and these libraries can make you code run blazingly fast. - -This project is the first project in a series of projects focused on the `pytorch` library. It is difficult to understand why a library like `pytorch` is useful without introducing _some_ math. This series of projects will involve some math, however, only at a very high level. Some intuition will be presented as notes, but what is really needed is the ability to read some formulas, and perform the appropriate computations. Throughout this series of projects, we will do our best to ensure that math or statistics is not at all a barrier to completing these projects and getting familiar with `pytorch`. If it does end up an issue, please post in Piazza and we will do our best to address any issues as soon as possible. - -This first project will start slowly, and only focus on the `numpy` -like functionality of `pytorch`. We've provided you with a set of 100 observations. 75 of the observations are in the `train.csv` file, 25 are in the `test.csv` file. We will build a regression model using the data in the `train.csv` file. In addition, we will calculate some other statistics. Finally, we will (optionally) test our model out on new data in the `test.csv` dataset. - -Start by reading the `train.csv` file into a `pytorch` tensor. - -[TIP] -==== -[source,python] ----- -import torch -import pandas as pd - -dat = pd.read_csv('/depot/datamine/data/sim/train.csv') -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) ----- -==== - -[NOTE] -==== -A tensor is just a n-dimensional array. -==== - -Use `matplotlib` or `plotly` to plot the data on a scatterplot -- `x_train` on the x-axis, and `y_train` on the y-axis. After talking to your colleague, you agreed that the data is clearly following a 2nd order polynomial. Something like: - -$y = \beta_0 + \beta_1x + \beta_2x^2$ - -Our goal will be to estimate the values of $\beta_0$, $\beta_1$, and $\beta_2$ using the data in `x_train` and `y_train`. Then, we will have a model that could look something like: - -$y = 1.2 + .4x + 2.2x^2$ - -Then, for any given value of x, we can use our model to predict the value of y. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In order to build our model, we need to estimate our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. Luckily, linear regression has a closed form solution, so we can calculate these values directly with the following equation. - -$\hat{\beta} = (X^{T} X)^{-1} X^{T} y$ - -What do these symbols mean? X is a matrix (or tensor), where each column is a term in the polynomial, and each row is an observation. So, for our polynomial, if our X data was simply: 1, 2, 3, 4, the X matrix (or design matrix) would be the following: - -.X ----- -1, 1, 1 -1, 2, 4 -1, 3, 9 -1, 4, 16 ----- - -Here, the first column is the constant term, the second column is the term of x, the third column is the term of $x^2$, and so on. - -When we raise the matrix to the "T" this means to transpose the matrix. The transpose of X, for example, would look like: - -.X^T ----- -1, 1, 1, 1 -1, 2, 3, 4 -1, 3, 9, 16 ----- - -When we raise the matrix to the "-1" this means to invert the matrix. - -Finally, placing these matrices next to each other means we need to perform matrix multiplication. - -`pytorch` has built in functions to do all of these operations: `torch.mm`, `mat.T`, and `torch.inverse`. - -Lastly, `y` is the tensor containing the observations in `y_train`. - -[IMPORTANT] -==== -Tensors must be the correct dimensions before they can be multiplied together using `torch.mm`. By default, `x_train` and `y_train` will be a single row and 75 columns. In order to change this to be a single column and 75 rows, we would need to use the `reshape` method: `x_train.reshape(75,1)`. - -When doing matrix multiplication, it is important that the tensors are aligned properly. A 4x1 matrix would be a matrix that has 4 rows and 1 column (the first number always represents the number of rows, the second always represents the number of columns). - -In order to multiply 2 matrices together, the number of columns in the first matrix must equal the number of rows in the second matrix. The resulting matrix would then have the number of rows as the first matrix, and the number of columns of the second matrix. So, if we multiplied a 4x3 matrix with a 3x5 matrix, the result would be a 4x5 matrix. - -These rules are important, because the tensors must be the correct shape (correct number of rows and columns) before we perform matrix multiplication, otherwise we will get an error. - -The `reshape` method allows you to specify the number of rows and columns in the tensor, for example, `x_train.reshape(75,1)`, would result in a matrix with 75 rows and a single column. You will need to be careful to make sure your tensors are the correct shape before multiplication. -==== - -Start by creating a new tensor called `x_mat` that is 75 rows and 3 columns. The first column should be filled with 1's (using `torch.ones(x_train.shape[0]).reshape(75,1)`), the second column should be the values in `x_train`, the third column should be the values in `x_train` squared. Use `torch.cat` to combine the 75x1 tensors into a single 75x3 tensor (`x_mat`). - -[IMPORTANT] -==== -Make sure you reshape all of your tensors to be 75x1 _before_ you use `torch.cat` to combine them into a 75x3 tensor. -==== - -[TIP] -==== -Operations like addition and subtraction are vectorized. For example, the following would result in a 75x1 tensor of 2's. - -[source,python] ----- -x = torch.ones(75,1) -x*2 ----- - -The following would result in a 1x75 tensor of .5's. - -[source,python] ----- -x = torch.ones(1,75) -x/2 ----- -==== - -[TIP] -==== -Remember, in Python, you can use: - -[source,python] ----- -** ----- - -to raise a number to a power. For example $2^3$ would be - -[source,python] ----- -2**3 ----- -==== - -[TIP] -==== -To get the transpose of a tensor 2 dimension tensor in `pytorch` you could use `x_mat.T`, or `torch.transpose(x_mat, 0, 1)`, where 0 is the first dimension to transpose and 1 is the second dimension to transpose. -==== - -Calculate our estimates for $\beta_0$, $\beta_1$, and $\beta_2$, and save the values in a tensor called `betas`. The following should be the successful result. - -.results ----- -tensor([[ 4.3677], - [-1.7885], - [ 0.4840]], dtype=torch.float64) ----- - -Now that you know the values for $\beta_0$, $\beta_1$, and $\beta_2$, what is our model (as an equation)? It should be: - -$y = 4.3677-1.7885x+.4840x^2$ - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -That is pretty cool, and very fast. Now, for any given value of x, we can predict a value of y. Of course, we _could_ write a predict function that accepts a value x, and returns our prediction y, and apply that function to each of the x values in our `x_train` tensor, however, this can be accomplished even faster and more flexibly using matrix multiplication -- simply use the following formula: - -$\hat{y} = X\hat{\beta}$ - -Where X is the `x_mat` tensor from earlier, and $\hat{\beta}$ is the `betas` tensor from question (2). Use `torch.mm` to multiply the two matrices together. Save the resulting tensor to a variable called `y_predictions`. Finally, create side by side scatterplots. In the first scatterplot, put the values in `x_train` on the x-axis and the values of `y_train` on the y-axis. In the second scatterplot put the values of `x_train` on the x-axis, and your predictions (`y_predictions`) on the y-axis. - -Very cool! Your model should be killing it (after all, we generated this data to follow a known distribution). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -To better understand our model, let us create one of the most common forms of uncertainty quantification, confidence intervals. Confidence intervals (95% confidence intervals) show you the range of values (for each x) where we are 95% confident that the average value y for a given x is within the range. - -The formula is the following: - -$\hat{y_h} \pm t_{(\alpha/2, n-p)} * \sqrt{MSE * diag(x_h(X^{T} X)^{-1} x_h^{T})}$ - -$MSE = \frac{1}{n-p}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -Since we are calculating the 95% confidence interval for the values of x in our `x_train` tensor, we can simplify this to: - -$\hat{Y} \pm 1.993464 * \sqrt{MSE * diag(X(X^{T} X)^{-1} X^{T})}$ - -$\frac{1}{72}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -Where: - -- $\hat{Y}$ is our `y_predictions` tensor from question (3). -- $Y_i$ is the value of y for the ith value of `y_train`. -- $\hat{Y_i}$ is the value of y for the ith value of `y_predictions`. -+ -[TIP] -==== -You could simply sum the results of subtracting the `y_predictions` tensor from the `y_train` tensor, squared. You don't need any loop. -==== -+ -- p is the number of parameters in our model (3, the constant, the x, and the $x^2$). -- n is the number of observations in our data set (75). - -[TIP] -==== -The "diag" part of the formula indicates that we want the _diagonal_ of the resulting matrix. The diagonal of a given nxn matrix is the value at location (1,1), (2,2), (3,3), ..., (n,n). So, for instance, the diagonal of the following matrix is: 1, 5, 9 - -.matrix ----- -1,2,3 -4,5,6 -7,8,9 ----- - -In `pytorch`, you can get this using `torch.diag(x)`, where x is the matrix you want the diagonal of. - -[source,python] ----- -test = torch.tensor([1,2,3,4,5,6,7,8,9]).reshape(3,3) -torch.diag(test) ----- -==== - -[TIP] -==== -You can use `torch.sum` to sum up the values in a tensor. -==== - -[TIP] -==== -The value for MSE should be 135.5434. - -The first 5 values of the `upper` confidence interval are: - -.upper ----- -tensor([[171.3263], - [ 91.9131], - [ 83.3474], - [ 63.8171], - [ 63.0524]], dtype=torch.float64) ----- - -The first 5 values of the `lower` confidence interval are: - -.lower ----- -tensor([[140.6660], - [ 76.2350], - [ 69.1461], - [ 52.7601], - [ 52.1101]], dtype=torch.float64) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Create a scatterplot of `x_train` on the x-axis, and `y_predictions` on the y-axis. Add the confidence intervals to the plot. - -Great! It is unsurprising that our model is a great fit. - -[TIP] -==== -See https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.fill_between.html[here] for the documentation on `fill_between`. This function can be used to shade from the lower to upper confidence bounds. Use this function after you've https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.plot.html[plotted] your values of x (`x_mat[:, 1]`) on the x-axis and values of `y_predictions` on your y-axis. -==== - -[NOTE] -==== -In this project, we explored a well known model using simulated data from a known distribution. It is pretty boring, but boring can also make things a bit easier to understand. - -To give a bit of perspective, this project focused on tensor operations so you could get used to `pytorch`. The power of `pytorch` starts to really show itself when the problems do not have a closed form solution. In the _next_ project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent, and algorithms like it are used frequently, it will give you a good sense on _why_ `pytorch` is useful. In addition, because we solved this problem using the closed form solutions, we will be able to easily verify that our work in the _next_ project is working as intended! - -Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertaintly quantification measures. We will use SLURM in combination with `pytorch` to resample our data and calculate point estimates, which can then be used to understand the variability. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09-teachingprogramming.adoc deleted file mode 100644 index b340d9a01..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09-teachingprogramming.adoc +++ /dev/null @@ -1,364 +0,0 @@ -= STAT 39000: Project 9 -- Spring 2022 -:page-mathjax: true - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries — `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the second in a series of 4-5 projects focused on pytorch (and perhaps JAX). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background. - -**Scope:** Python, pytorch - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the `pytorch` API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use `pytorch` to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/sim/train.csv` -- `/depot/datamine/data/sim/test.csv` - -== Questions - -=== Question 1 - -[WARNING] -==== -If you did not attempt the previous project, some of the novelty of `pytorch` may be lost. The following is a note found at the end of question 5 from the previous project. - -In this project, we explored a well known model using simulated data from a known distribution. It was pretty boring, but boring can also make things a bit easier to understand. - -To give a bit of perspective, the previous project focused on tensor operations so you could get used to `pytorch`. The power of `pytorch` really starts to show itself when the problem you are facing does not have a closed form solution. In _this_ project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent is an algorithm and not a technique that offers a simple closed form solutions, and algorithms like gradient descent are used frequently, this project will _hopefully_ give you a good sense on _why_ `pytorch` is useful. In addition, since we fit a regression model using a closed form solution in the previous project, we will be able to easily verify that our work in _this_ project is working as intended! - -Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertainty quantification measures. In the _next_ project, we will use SLURM in combination with `pytorch` to re-sample our data and calculate point estimates, which can then be used to understand the variability. -==== - -[NOTE] -==== -This project will _show_ more calculus than you need to know or understand for this course. It is included for those who are interested, and so the reader can see "oh my, that is a lot of work we are avoiding!". Don't worry _at all_, is is not necessary to understand for this course. -==== - -Start by reading in your `train.csv` data into tensors called `x_train` and `y_train`. - -[source,python] ----- -import pandas as pd -import torch - -dat = pd.read_csv("/depot/datamine/data/sim/train.csv") -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) ----- - -In the previous project, we estimated the parameters of our regression model using a closed form solution. What does this do? At the heart of the regression model, we are _minimizing_ our _loss_. Typically, this _loss_ is the mean squared error (MSE). The formula for MSE is: - -$MSE = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -[NOTE] -==== -You can think of MSE as the difference between the actual y values (from the training data) and the y values our model predicts, squared, summed, and then divided by $n$, or the number of observations. Larger differences, say a difference of 10, is given a stronger penalty (100) than say, a difference of 5 (25). In this way, MSE as the loss function, tries to make the _overall_ predictions good. -==== - -Using our closed form solution formulas, we can calculate the parameters such that the MSE is minimized over the entirety of our training data. This time, we will use gradient descent to iteratively calculate our parameter estimates! - -By plotting our data, we can see that our data is parabolic and follows the general form: - -$y = \beta_{0} + \beta_{1} x + \beta_{2} x^{2}$ - -If we substitute this into our formula for MSE, we get: - -$MSE = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - ( \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} ) )^{2} = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - \beta_{0} - \beta_{1} x_{i} - \beta_{2} x_{i}^{2} )^{2}$ - -The first step in gradient descent is to calculate the partial derivatives with respect to each of our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. - -These derivatives will let us know the _slope_ of the tangent line for the given parameter with the given value. We can then _use_ this slope to adjust our parameter, and eventually reach a parameter value that minimizes our _loss_ function. Here is the calculus. - -$\frac{\partial MSE}{\partial \beta_0} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_0}$ - -$\frac{\partial MSE}{\partial \hat{y_i}} = 1$ - -$\frac{\partial \hat{y_i}}{\beta_0} = 2(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_1}$ - -$\frac{\partial \hat{y_i}}{\partial \beta_1} = 2x(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_2} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_2}$ - -$\frac{\partial \hat{y_{i}}}{\partial \beta_{2}} = 2x^{2} (\beta_{0} + \beta_{1} x + \beta2_x^{2} - y_{i})$ - -If we clean things up a bit, we can see that the partial derivatives are: - -$\frac{\partial MSE}{\partial \beta_0} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_{2}} = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \beta_{0} - \beta_{1} - \beta_{2} x^{2}) = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \hat{y_{i}})$ - -Pick 3 random values -- 1 for each parameter, $\beta_0$, $\beta_1$, and $\beta_2$. For consistency, lets try 5, 4, and 3 respectively. These values will be our random "guess" as to the actual values of our parameters. Using those starting values, calculate the partial derivitive for each parameter. - -[TIP] -==== -Start by calculating `y_predictions` using the formula: $\beta_0 + \beta_1x + \beta_2x^2$, where $x$ is your `x_train` tensor! -==== - -[TIP] -==== -You should now have tensors `x_train`, `y_train`, and `y_predictions`. You can create another new tensor called `error` by subtracting `y_predictions` from `y_train`. -==== - -[TIP] -==== -You can use your tensors and the `mean` method to (help) calculate each of these partial derivatives! Note that these values could vary from person to person depending on the random starting values you gave each of your parameters. -==== - -Okay, once you have your 3 partial derivatives, we can _update_ our 3 parameters using those values! Remember, those values are the _slope_ of the tangent line for each of the parameters for the corresponding parameter value. If by _increasing_ a parameter value we _increase_ our MSE, then we want to _decrease_ our parameter value as this will _decrease_ our MSE. If by _increasing_ a parameter value we _decrease_ our MSE, then we want to _increase_ our parameter value as this will _decrease_ our MSE. This can be represented, for example, by the following: - -$\beta_0 = \beta_0 - \frac{\partial MSE}{\partial \beta_0}$ - -This will however potentially result in too big of a "jump" in our parameter value -- we may skip over the value of $\beta_0$ for which our MSE is minimized (this is no good). In order to "fix" this, we introduce a "learning rate", often shown as $\eta$. This learning rate can be tweaked to either ensure we don't make too big of a "jump" by setting it to be small, or by making it a bit larger, increasing the speed at which we _converge_ to a value of $\beta_0$ for which our MSE is minimized, at the risk of having the issue of over jumping. - -$\beta_0 = \beta_0 - \eta \frac{\partial MSE}{\partial \beta_0}$ - -Update your 3 parameters using a learning rate of $\eta = 0.0003$. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Woohoo! That was a _lot_ of work for what ended up being some pretty straightforward calculations. The previous question represented a single _epoch_. You can define the number of epochs yourself, the idea is that _hopefully_ after all of your epochs, the parameters will have converged, leaving your with the parameter estimates you can use to calculate predictions! - -Write code that runs 10000 epochs, updating your parameters as it goes. In addition, include code in your loops that prints out the MSE every 100th epoch. Remember, we are trying to _minimize_ our MSE -- so we would expect that the MSE _decreases_ each epoch. - -Print the final values of your parameters -- are the values close to the values you estimated in the previous project? - -In addition, approximately how many epochs did it take for the MSE to stop decreasing by a significant amount? Based on that result, do you think we could have run fewer epochs? - -[NOTE] -==== -Mess around with the starting values of your parameters, and the learning rate. You will quickly notice that bad starting values can result in final results that are not very good. A learning rate that is too large will diverge, resulting in `nan`. A learning rate that is too small won't learn fast enough resulting in parameter values that aren't accurate. - -The learning rate is a hyperparameter -- a parameter that is chosen _before_ the training process begins. The number of epochs is also a hyperparameter. Choosing good hyperparameters can be critical, and there are a variety of methods to help "tune" hyperparameters. For this project, we know that these values work well. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You may be wondering think at this point that `pytorch` has been pretty worthless, and it still doesn't make any sense how this simplifies anything. There was too much math, and we still performed a bunch of vector/tensor/matrix operations -- what gives? Well, while this is all true, we haven't utilized `pytorch` quite yet, but we are going to here soon. - -First, let's cover some common terminology you may run across. In each epoch, when we calculate the newest predictions for our most up-to-date parameter values, we are performing the _forward pass_. - -There is a similarly named _backward pass_ that refers (roughly) to the step where the partial derivatives are calculated! Great. - -`pytorch` can perform the _backward pass_ for you, automatically, from our MSE. For example, see the following. - -[source,python] ----- -mse = (error**2).mean() -mse.backward() ----- - -Try it yourself! - -[TIP] -==== -If you get an error: - -.error ----- -RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ----- - -This is likely due to the fact that your starting values aren't tensors! Instead, use tensors. - -[source,python] ----- -beta0 = torch.tensor(5) -beta1 = torch.tensor(4) -beta2 = torch.tensor(3) ----- - -What? We _still_ get that error. In order for the `backward` method to work, and _automatically_ (yay!) calculate our partial derivatives, we need to make sure that our starting value tensors are set to be able to store the partial derivatives. We can do this very easily by setting the `requires_grad=True` option when creating the tensors. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True) -beta1 = torch.tensor(4, requires_grad=True) -beta2 = torch.tensor(3, requires_grad=True) ----- - -You probably got the following error now. - -.error ----- -RuntimeError: Only Tensors of floating point and complex dtype can require gradients ----- - -Well, let's set the dtype to be `torch.float` and see if that does the trick, then. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) ----- - -Great! Unfortunately, after you try to run your epochs, you will likely get the following error. - -.error ----- -TypeError: unsupported operand type(s) for *: 'float' and 'NoneType' ----- - -This is because your `beta0.grad`, `beta1.grad` are None -- why? The partial derivatives (or gradients) are stored in the `beta0`, `beta1`, and `beta2` tensors. If you performed a parameter update as follows. - -[source,python] ----- -beta0 = beta0 - learning_rate * beta0.grad ----- - -The _new_ `beta0` object will have _lost_ the partial derivative information, and the `beta0.grad` will be `None`, causing the error. How do we get around this? We can use a Python _inplace_ operation. An _inplace_ operation will actually _update_ our _original_ `beta0` (_with_ the gradients already saved), instead of creating a brand new `beta0` that loses the gradient. You've probably already seen examples of this in the wild. - -[source,python] ----- -# these are equivalent -a = a - b -a -= b - -# or -a = a * b -a *= b - -# or -a = a + b -a += b - -# etc... ----- - -At this point in time, you are probably _once again_ getting the following error. - -.error ----- -RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ----- - -This too is an easy fix, simply wrap your update lines in a `with torch.no_grad():` block. - -[source,python] ----- -with torch.no_grad(): - beta0 -= ... - beta1 -= ... - beta2 -= ... ----- - -Woohoo! Finally! But... you may notice (if you are printing your MSE) that the MSE is all over the place and not decreasing like we would expect. This is because the gradients are summed up each iteration unless your clear the gradient out! For example, if during the first epoch the gradient is 603, and the next epoch it is -773. If you do _not_ zero out the gradient, your new gradient after the second epoch will be -169, when we really want -773. To fix _this_, use the `zero_` method from the `grad` attribute. Zero out all of your gradients at the end of each epoch and try again. - -[source,python] ----- -beta0.grad.zero_() ----- - -Finally! It should all be looking good right now. Okay, so `pytorch` is quite particular, _but_ the power of the automatic differentiation can't be overstated. -==== - -[IMPORTANT] -==== -Make sure and make a post on Piazza if you'd like some extra help or think there is a question that could use more attention. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Whoa! That is crazy powerful! That _greatly_ reduces the amount of work we need to do. We didn't use our partial derivative formulas anywhere, how cool! - -But wait, there's more! You know that step where we update our parameters at the end of each epoch? Think about a scenario where, instead of simply 3 parameters, we had 1000 parameters to update. That would involve a linear increase in the number of lines of code we would need to write -- instead of just 3 lines of code to update our 3 parameters, we would need 1000! Not something most folks are interested in doing. `pytorch` to the rescue. - -We can use an _optimizer_ to perform the parameter updates, all at once! Update your code to utilize an optimizer to perform the parameter updates. - -There are https://pytorch.org/docs/stable/optim.html[a variety] of different optimizers available. For this project, let's use the `SGD` optimizer. You can see the following example, directly from the linked webpage. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.zero_grad() -loss_fn(model(input), target).backward() -optimizer.step() ----- - -Here, you can just focus on the following lines. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.step() ----- - -The first line is the initialization of the optimizer. Here, you really just need to pass our initialized paramters (the betas) as a list to the first argument to `optim.SGD`. The second argument, `lr`, should just be our learning rate (`0.0003`). - -Then, the second line replaces the code where the three parameters are updated. - -[NOTE] -==== -You will no longer need the `with torch.no_grad()` block at all! This completely replaces that code. -==== - -[TIP] -==== -In addition, you can use the optimizer to clear out the gradients as well! Replace the `zero_` methods with the `zero_grad` method of the optimizer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -You are probably starting to notice how `pytorch` can _really_ simplify things. But wait, there's more! - -In each epoch, you are still calculating the loss manually. Not a huge deal, but it could be a lot of work, and MSE is not the _only_ type of loss function. Use `pytorch` to create your MSE loss function, and use it instead of your manual calculation. - -You can find `torch.nn.MSELoss` documentation https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss[here]. Use the option `reduction='mean'` to get the mean MSE loss. Once you've created your loss function, simply pass your `y_train` as the first argument and your `y_predictions` as the second argument. Very cool! This has been a lot to work on -- the main takeaways here should be that `pytorch` has the capability of greatly simplifying code (and calculus!) like the code used for the gradient descent algorithm. At the same time, `pytorch` is particular, the error messages aren't extremely clear, and it definitely involves a learning curve. - -We've barely scraped the surface of `pytorch` -- there is (always) a _lot_ more to learn! In the next project, we will provide you with the opportunity to utilize a GPU to speed up calculations, and SLURM to parallelize some costly calculations. - -[NOTE] -==== -In the next project we will use `pytorch` to build a model to simplify our code even more, in addition, we will incorporate SLURM and use a GPU to train our model. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09.adoc deleted file mode 100644 index b340d9a01..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project09.adoc +++ /dev/null @@ -1,364 +0,0 @@ -= STAT 39000: Project 9 -- Spring 2022 -:page-mathjax: true - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries — `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the second in a series of 4-5 projects focused on pytorch (and perhaps JAX). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background. - -**Scope:** Python, pytorch - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the `pytorch` API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use `pytorch` to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/sim/train.csv` -- `/depot/datamine/data/sim/test.csv` - -== Questions - -=== Question 1 - -[WARNING] -==== -If you did not attempt the previous project, some of the novelty of `pytorch` may be lost. The following is a note found at the end of question 5 from the previous project. - -In this project, we explored a well known model using simulated data from a known distribution. It was pretty boring, but boring can also make things a bit easier to understand. - -To give a bit of perspective, the previous project focused on tensor operations so you could get used to `pytorch`. The power of `pytorch` really starts to show itself when the problem you are facing does not have a closed form solution. In _this_ project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent is an algorithm and not a technique that offers a simple closed form solutions, and algorithms like gradient descent are used frequently, this project will _hopefully_ give you a good sense on _why_ `pytorch` is useful. In addition, since we fit a regression model using a closed form solution in the previous project, we will be able to easily verify that our work in _this_ project is working as intended! - -Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertainty quantification measures. In the _next_ project, we will use SLURM in combination with `pytorch` to re-sample our data and calculate point estimates, which can then be used to understand the variability. -==== - -[NOTE] -==== -This project will _show_ more calculus than you need to know or understand for this course. It is included for those who are interested, and so the reader can see "oh my, that is a lot of work we are avoiding!". Don't worry _at all_, is is not necessary to understand for this course. -==== - -Start by reading in your `train.csv` data into tensors called `x_train` and `y_train`. - -[source,python] ----- -import pandas as pd -import torch - -dat = pd.read_csv("/depot/datamine/data/sim/train.csv") -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) ----- - -In the previous project, we estimated the parameters of our regression model using a closed form solution. What does this do? At the heart of the regression model, we are _minimizing_ our _loss_. Typically, this _loss_ is the mean squared error (MSE). The formula for MSE is: - -$MSE = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -[NOTE] -==== -You can think of MSE as the difference between the actual y values (from the training data) and the y values our model predicts, squared, summed, and then divided by $n$, or the number of observations. Larger differences, say a difference of 10, is given a stronger penalty (100) than say, a difference of 5 (25). In this way, MSE as the loss function, tries to make the _overall_ predictions good. -==== - -Using our closed form solution formulas, we can calculate the parameters such that the MSE is minimized over the entirety of our training data. This time, we will use gradient descent to iteratively calculate our parameter estimates! - -By plotting our data, we can see that our data is parabolic and follows the general form: - -$y = \beta_{0} + \beta_{1} x + \beta_{2} x^{2}$ - -If we substitute this into our formula for MSE, we get: - -$MSE = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - ( \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} ) )^{2} = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - \beta_{0} - \beta_{1} x_{i} - \beta_{2} x_{i}^{2} )^{2}$ - -The first step in gradient descent is to calculate the partial derivatives with respect to each of our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. - -These derivatives will let us know the _slope_ of the tangent line for the given parameter with the given value. We can then _use_ this slope to adjust our parameter, and eventually reach a parameter value that minimizes our _loss_ function. Here is the calculus. - -$\frac{\partial MSE}{\partial \beta_0} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_0}$ - -$\frac{\partial MSE}{\partial \hat{y_i}} = 1$ - -$\frac{\partial \hat{y_i}}{\beta_0} = 2(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_1}$ - -$\frac{\partial \hat{y_i}}{\partial \beta_1} = 2x(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_2} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_2}$ - -$\frac{\partial \hat{y_{i}}}{\partial \beta_{2}} = 2x^{2} (\beta_{0} + \beta_{1} x + \beta2_x^{2} - y_{i})$ - -If we clean things up a bit, we can see that the partial derivatives are: - -$\frac{\partial MSE}{\partial \beta_0} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_{2}} = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \beta_{0} - \beta_{1} - \beta_{2} x^{2}) = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \hat{y_{i}})$ - -Pick 3 random values -- 1 for each parameter, $\beta_0$, $\beta_1$, and $\beta_2$. For consistency, lets try 5, 4, and 3 respectively. These values will be our random "guess" as to the actual values of our parameters. Using those starting values, calculate the partial derivitive for each parameter. - -[TIP] -==== -Start by calculating `y_predictions` using the formula: $\beta_0 + \beta_1x + \beta_2x^2$, where $x$ is your `x_train` tensor! -==== - -[TIP] -==== -You should now have tensors `x_train`, `y_train`, and `y_predictions`. You can create another new tensor called `error` by subtracting `y_predictions` from `y_train`. -==== - -[TIP] -==== -You can use your tensors and the `mean` method to (help) calculate each of these partial derivatives! Note that these values could vary from person to person depending on the random starting values you gave each of your parameters. -==== - -Okay, once you have your 3 partial derivatives, we can _update_ our 3 parameters using those values! Remember, those values are the _slope_ of the tangent line for each of the parameters for the corresponding parameter value. If by _increasing_ a parameter value we _increase_ our MSE, then we want to _decrease_ our parameter value as this will _decrease_ our MSE. If by _increasing_ a parameter value we _decrease_ our MSE, then we want to _increase_ our parameter value as this will _decrease_ our MSE. This can be represented, for example, by the following: - -$\beta_0 = \beta_0 - \frac{\partial MSE}{\partial \beta_0}$ - -This will however potentially result in too big of a "jump" in our parameter value -- we may skip over the value of $\beta_0$ for which our MSE is minimized (this is no good). In order to "fix" this, we introduce a "learning rate", often shown as $\eta$. This learning rate can be tweaked to either ensure we don't make too big of a "jump" by setting it to be small, or by making it a bit larger, increasing the speed at which we _converge_ to a value of $\beta_0$ for which our MSE is minimized, at the risk of having the issue of over jumping. - -$\beta_0 = \beta_0 - \eta \frac{\partial MSE}{\partial \beta_0}$ - -Update your 3 parameters using a learning rate of $\eta = 0.0003$. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Woohoo! That was a _lot_ of work for what ended up being some pretty straightforward calculations. The previous question represented a single _epoch_. You can define the number of epochs yourself, the idea is that _hopefully_ after all of your epochs, the parameters will have converged, leaving your with the parameter estimates you can use to calculate predictions! - -Write code that runs 10000 epochs, updating your parameters as it goes. In addition, include code in your loops that prints out the MSE every 100th epoch. Remember, we are trying to _minimize_ our MSE -- so we would expect that the MSE _decreases_ each epoch. - -Print the final values of your parameters -- are the values close to the values you estimated in the previous project? - -In addition, approximately how many epochs did it take for the MSE to stop decreasing by a significant amount? Based on that result, do you think we could have run fewer epochs? - -[NOTE] -==== -Mess around with the starting values of your parameters, and the learning rate. You will quickly notice that bad starting values can result in final results that are not very good. A learning rate that is too large will diverge, resulting in `nan`. A learning rate that is too small won't learn fast enough resulting in parameter values that aren't accurate. - -The learning rate is a hyperparameter -- a parameter that is chosen _before_ the training process begins. The number of epochs is also a hyperparameter. Choosing good hyperparameters can be critical, and there are a variety of methods to help "tune" hyperparameters. For this project, we know that these values work well. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You may be wondering think at this point that `pytorch` has been pretty worthless, and it still doesn't make any sense how this simplifies anything. There was too much math, and we still performed a bunch of vector/tensor/matrix operations -- what gives? Well, while this is all true, we haven't utilized `pytorch` quite yet, but we are going to here soon. - -First, let's cover some common terminology you may run across. In each epoch, when we calculate the newest predictions for our most up-to-date parameter values, we are performing the _forward pass_. - -There is a similarly named _backward pass_ that refers (roughly) to the step where the partial derivatives are calculated! Great. - -`pytorch` can perform the _backward pass_ for you, automatically, from our MSE. For example, see the following. - -[source,python] ----- -mse = (error**2).mean() -mse.backward() ----- - -Try it yourself! - -[TIP] -==== -If you get an error: - -.error ----- -RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ----- - -This is likely due to the fact that your starting values aren't tensors! Instead, use tensors. - -[source,python] ----- -beta0 = torch.tensor(5) -beta1 = torch.tensor(4) -beta2 = torch.tensor(3) ----- - -What? We _still_ get that error. In order for the `backward` method to work, and _automatically_ (yay!) calculate our partial derivatives, we need to make sure that our starting value tensors are set to be able to store the partial derivatives. We can do this very easily by setting the `requires_grad=True` option when creating the tensors. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True) -beta1 = torch.tensor(4, requires_grad=True) -beta2 = torch.tensor(3, requires_grad=True) ----- - -You probably got the following error now. - -.error ----- -RuntimeError: Only Tensors of floating point and complex dtype can require gradients ----- - -Well, let's set the dtype to be `torch.float` and see if that does the trick, then. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) ----- - -Great! Unfortunately, after you try to run your epochs, you will likely get the following error. - -.error ----- -TypeError: unsupported operand type(s) for *: 'float' and 'NoneType' ----- - -This is because your `beta0.grad`, `beta1.grad` are None -- why? The partial derivatives (or gradients) are stored in the `beta0`, `beta1`, and `beta2` tensors. If you performed a parameter update as follows. - -[source,python] ----- -beta0 = beta0 - learning_rate * beta0.grad ----- - -The _new_ `beta0` object will have _lost_ the partial derivative information, and the `beta0.grad` will be `None`, causing the error. How do we get around this? We can use a Python _inplace_ operation. An _inplace_ operation will actually _update_ our _original_ `beta0` (_with_ the gradients already saved), instead of creating a brand new `beta0` that loses the gradient. You've probably already seen examples of this in the wild. - -[source,python] ----- -# these are equivalent -a = a - b -a -= b - -# or -a = a * b -a *= b - -# or -a = a + b -a += b - -# etc... ----- - -At this point in time, you are probably _once again_ getting the following error. - -.error ----- -RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ----- - -This too is an easy fix, simply wrap your update lines in a `with torch.no_grad():` block. - -[source,python] ----- -with torch.no_grad(): - beta0 -= ... - beta1 -= ... - beta2 -= ... ----- - -Woohoo! Finally! But... you may notice (if you are printing your MSE) that the MSE is all over the place and not decreasing like we would expect. This is because the gradients are summed up each iteration unless your clear the gradient out! For example, if during the first epoch the gradient is 603, and the next epoch it is -773. If you do _not_ zero out the gradient, your new gradient after the second epoch will be -169, when we really want -773. To fix _this_, use the `zero_` method from the `grad` attribute. Zero out all of your gradients at the end of each epoch and try again. - -[source,python] ----- -beta0.grad.zero_() ----- - -Finally! It should all be looking good right now. Okay, so `pytorch` is quite particular, _but_ the power of the automatic differentiation can't be overstated. -==== - -[IMPORTANT] -==== -Make sure and make a post on Piazza if you'd like some extra help or think there is a question that could use more attention. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Whoa! That is crazy powerful! That _greatly_ reduces the amount of work we need to do. We didn't use our partial derivative formulas anywhere, how cool! - -But wait, there's more! You know that step where we update our parameters at the end of each epoch? Think about a scenario where, instead of simply 3 parameters, we had 1000 parameters to update. That would involve a linear increase in the number of lines of code we would need to write -- instead of just 3 lines of code to update our 3 parameters, we would need 1000! Not something most folks are interested in doing. `pytorch` to the rescue. - -We can use an _optimizer_ to perform the parameter updates, all at once! Update your code to utilize an optimizer to perform the parameter updates. - -There are https://pytorch.org/docs/stable/optim.html[a variety] of different optimizers available. For this project, let's use the `SGD` optimizer. You can see the following example, directly from the linked webpage. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.zero_grad() -loss_fn(model(input), target).backward() -optimizer.step() ----- - -Here, you can just focus on the following lines. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.step() ----- - -The first line is the initialization of the optimizer. Here, you really just need to pass our initialized paramters (the betas) as a list to the first argument to `optim.SGD`. The second argument, `lr`, should just be our learning rate (`0.0003`). - -Then, the second line replaces the code where the three parameters are updated. - -[NOTE] -==== -You will no longer need the `with torch.no_grad()` block at all! This completely replaces that code. -==== - -[TIP] -==== -In addition, you can use the optimizer to clear out the gradients as well! Replace the `zero_` methods with the `zero_grad` method of the optimizer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -You are probably starting to notice how `pytorch` can _really_ simplify things. But wait, there's more! - -In each epoch, you are still calculating the loss manually. Not a huge deal, but it could be a lot of work, and MSE is not the _only_ type of loss function. Use `pytorch` to create your MSE loss function, and use it instead of your manual calculation. - -You can find `torch.nn.MSELoss` documentation https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss[here]. Use the option `reduction='mean'` to get the mean MSE loss. Once you've created your loss function, simply pass your `y_train` as the first argument and your `y_predictions` as the second argument. Very cool! This has been a lot to work on -- the main takeaways here should be that `pytorch` has the capability of greatly simplifying code (and calculus!) like the code used for the gradient descent algorithm. At the same time, `pytorch` is particular, the error messages aren't extremely clear, and it definitely involves a learning curve. - -We've barely scraped the surface of `pytorch` -- there is (always) a _lot_ more to learn! In the next project, we will provide you with the opportunity to utilize a GPU to speed up calculations, and SLURM to parallelize some costly calculations. - -[NOTE] -==== -In the next project we will use `pytorch` to build a model to simplify our code even more, in addition, we will incorporate SLURM and use a GPU to train our model. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project10.adoc deleted file mode 100644 index ebf9417a2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project10.adoc +++ /dev/null @@ -1,386 +0,0 @@ -= STAT 39000: Project 10 -- Spring 2022 - -**Motivation:** In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some `pytorch` work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model. - -**Context:** This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on `pytorch` and `JAX`. We will use `pytorch` for our calculations. - -**Scope:** SLURM, unix, bash, `pytorch`, Python - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the pytorch API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/sim/train.csv` -- `/depot/datamine/data/sim/test.csv` -- `/depot/datamine/data/sim/train100k.csv` -- `/depot/datamine/data/sim/train10m.csv` - -== Questions - -[WARNING] -==== -You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you've already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later. -==== - -[NOTE] -==== -This project is broken into two parts. In part 1, we will use `pytorch` and build our model using cpus and gpus, and draw comparisons. Models will be built using datasets of differing sizes. The goal of part 1 is to see how a GPU _can_ make a large impact on training time. Note that these datasets are synthetic data and don't really represent a realistic scenario, but they _do_ work well to illustrate how powerful GPUs are. - -Part 2 is a continuation from the previous project. In the previous project, you used `pytorch` to perform a gradient descent and build a model for our small, simulated dataset. While it is certainly possible to use other methods to get some form of uncertainty quantification (in our case, we are specifically looking at a 95% confidence interval for our predictions), it is not always easy to do so, or possible. One of the most common methods to calculate these things, in these difficult situations is bootstrapping. In fact, Dr. Andrew Gelman, a world-class statistician, had this as his second item in his https://arxiv.org/pdf/2012.00174.pdf[list of the top 50 influential statistical ideas in the past 50 years]. We will use SLURM to perform this computationally intensive, but relatively simple method. -==== - -=== Part 1 - -[IMPORTANT] -==== -This question should be completed on Scholar, as Scholar has a GPU that you can use. - -Start by navigating to https://gateway.scholar.rcac.purdue.edu, and launching a terminal. In the terminal, run the following. - -[source,bash] ----- -mkdir -p ~/.local/share/jupyter/kernels/tdm-f2021-s2022 -cp /class/datamine/apps/jupyter/kernels/f2021-s2022/kernel.json ~/.local/share/jupyter/kernels/tdm-f2021-s2022/kernel.json ----- - -This will give you access to a `f2021-s2022` kernel that is different than our normal kernel on Brown, but has the necessary packages to run the code we need to run. - -Next, delete your Jupyter instance and re-launch a fresh Jupyter Lab instance, and confirm you have access to the GPU. - -To launch the Jupyter Lab instance, click on "Jupyter Notebook" under the GUI section (do **not** use the "Jupyter Lab" in the "Datamine" section) and use the following options: - -- Queue: gpu (Max 4.0 hours) -- Number of hours: 0.5 -- Use Jupyter Lab instead of Jupyter Notebook (checked) - -To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they may be interesting to you. - -[source,python] ----- -import torch - -# see if cuda is available -torch.cuda.is_available() - -# see the current device -torch.cuda.current_device() - -# see the number of devices available -torch.cuda.device_count() - -# get the name of a device -torch.cuda.get_device_name(0) ----- -==== - -For this question you will use `pytorch` with cpus (like in the previous project) to build a model for `train.csv`, `train100k.csv`, and `train10m.csv`. Use the `%%time` Jupyter magic to time the calculation for each dataset. - -[TIP] -==== -The following is the code from the previous project that you can use to get started. - -[source,python] ----- -import torch -import pandas as pd - -dat = pd.read_csv("/depot/datamine/data/sim/train.csv") -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) - -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) -learning_rate = .0003 - -num_epochs = 10000 -optimizer = torch.optim.SGD([beta0, beta1, beta2], lr=learning_rate) -mseloss = torch.nn.MSELoss(reduction='mean') - -for idx in range(num_epochs): - # calculate the predictions / forward pass - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 - - # calculate the MSE - mse = mseloss(y_train, y_predictions) - - if idx % 100 == 0: - print(f"MSE: {mse}") - - # calculate the partial derivatives / backwards step - mse.backward() - - # update our parameters - optimizer.step() - - # zero out the gradients - optimizer.zero_grad() - -print(f"beta0: {beta0}") -print(f"beta1: {beta1}") -print(f"beta2: {beta2}") ----- -==== - -[IMPORTANT] -==== -For `train10m.csv`, instead of running the entire 10k epochs, just perform 100 epochs, and estimate the amount of time it would take to complete 10k epochs. We _try_ not to be _that_ mean, although, if you _do_ want to wait and see, that is perfectly fine. -==== - -Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using `train.csv`, `train100k.csv`, and `train10m.csv`. What percentage faster is the GPU calculations for each dataset? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Time it took to build the model for the `train.csv` and `train100k.csv` using cpus. In addition, the estimated time it would take to build the model for `train10m.csv`, again, using cpus. -- Time it took to build the model for the `train.csv`, `train100k.csv`, and `train10m.csv`, using gpus. -- What percentage faster (or slower) the GPU version is vs the CPU version for each dataset. -==== - -=== Part 2 - -[IMPORTANT] -==== -You can now save your notebook, and switch back to using Brown. Navigate to https://ondemand.brown.rcac.purdue.edu/ and launch a Jupyter Lab instance the way you normally would, and fill in your notebook with you solutions to part 2. **Be careful not to overwrite your output from part 1.** - -You will want to copy your notebook to Brown, first. To do so from Scholar, open a terminal and copy the notebook as follows. - -[source,bash] ----- -scp /home/purduealias/my_notebook.ipynb brown.rcac.purdue.edu:/home/purduealias/ ----- - -Or to copy from Brown. - -[source,bash] ----- -scp scholar.rcac.purdue.edu:/home/purduealias/my_notebook.ipynb /home/purduealias/ ----- -==== - -We've provided you with a Python script called `bootstrap_samples.py` that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our `x_train` tensor _from_ our `x_train` tensor, _with_ replacement. What this means is if our `x_train` contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values `xr_train`. Then proceed as normal, building your model using `xr_train` instead of `x_train`. - -In addition at the end of the script, we used your model to get predictions for all of the values in `x_test`. Save these predictions to a parquet file, for example, `0cd68e5e-134d-4575-a31d-2060644f4caa.parquet`, in a safe location, for example `$CLUSTER_SCRATCH/p10output/`. Each file will each contain a single set of point estimates for our predictions. - -.bootstrap_samples.py -[source,python] ----- -#!/scratch/brown/kamstut/tdm/apps/jupyter/kernels/f2021-s2022/.venv/bin/python - -import sys -import argparse -import pandas as pd -import random -import torch -from pathlib import Path -import uuid - - -class Regression(torch.nn.Module): - def __init__(self): - super().__init__() - self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float)) - self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float)) - self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float)) - - def forward(self, x): - return self.beta0 + self.beta1*x + self.beta2*x**2 - - -def get_point_estimates(x_train, y_train, x_test): - - model = Regression() - learning_rate = .0003 - - num_epochs = 10000 - optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - mseloss = torch.nn.MSELoss(reduction='mean') - - # resample data - resampled_idxs = random.choices(range(75), k=75) - xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75) - - for _ in range(num_epochs): - # set to training mode -- note this does not _train_ anything - model.train() - - # calculate the predictions / forward pass - y_predictions = model(xr_train) - - # calculate the MSE - mse = mseloss(y_train[resampled_idxs], y_predictions) - - # calculate the partial derivatives / backwards step - mse.backward() - - # update our parameters - optimizer.step() - - # zero out the gradients - optimizer.zero_grad() - - # get predictions - predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()}) - - return(predictions) - - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - bootstrap_parser = subparsers.add_parser("bootstrap", help="") - bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output") - bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to") - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "bootstrap": - - dat = pd.read_csv("/depot/datamine/data/sim/train.csv") - x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float) - y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float) - - dat = pd.read_csv("/depot/datamine/data/sim/test.csv") - x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float) - - for _ in range(args.n): - estimates = get_point_estimates(x_train, y_train, x_test) - estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet") - -if __name__ == "__main__": - main() ----- - -[IMPORTANT] -==== -Make sure your `p10output` directory exists! -==== - -[TIP] -==== -You can use the script like `./my_script.py bootstrap 10 --output /scratch/brown/purduealias/p10output/` to create 10 sets of point estimates. Make sure the `p10output` directory exists first! -==== - -Okay, there are a couple of other different modifications in the script. Carefully read through the code, and give you best explaination of the changes in 2-3 sentences. Add another 1-2 sentences with your opinion of the changes. - -Next, create your job script. Let's call this `p10_job.sh`. You can use the following code as a starting point for your script (from a previous project). We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes. - -.p10_job.sh -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine # Queue -#SBATCH --job-name=kevinsjob # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=kamstut@purdue.edu # Where to send mail -#SBATCH --time=00:30:00 -#SBATCH --ntasks=10 # Number of tasks (total) -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -arr=(/depot/datamine/data/coco/unlabeled2017/*) - -for((i=0; i < ${#arr[@]}; i+=12500)) -do - part=( "${arr[@]:i:12500}" ) - srun -A datamine --exclusive -n 1 --mem-per-cpu=200 module use /scratch/brown/kamstut/tdm/opt/modulefiles; module load libffi/3.4.2; $HOME/hash1.py hash --output $CLUSTER_SCRATCH/p4output/ ${part[*]} & -done - -wait ----- - -[TIP] -==== -You won't need any of that array stuff anymore since we don't have to keep track of the files we're working with. -==== - -[IMPORTANT] -==== -Make sure both `bootstrap_samples.py` and `p10_job.sh` have execute permissions. - -[source,bash] ----- -chmod +x /path/to/bootstrap_samples.py -chmod +x /path/to/p10_job.sh ----- -==== - -[IMPORTANT] -==== -Make sure you keep the `module use` and `module load` lines in your job script -- libffi is required for your code to run. -==== - -Submit your job using `sbatch p10_job.sh`. - -[WARNING] -==== -Make sure to clear out the SLURM environment variables if you choose to run the `sbatch` command from within a bash cell in your notebook. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- -==== - -Great! Now you have a directory `$CLUSTER_SCRATCH/p10output/` that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showign: - -. The _actual_ `y_test` values (in blue) as a set of points (using `plt.scatter`). -. The predictions as a line. -. The confidence intervals as a shaded region. (You can use `plt.fill_between`). - -The 95% confidence interval is simply the 97.5th percentile of each prediction's point estimates (upper) and the 2.5th percentile of each prediction's point estimates (lower). - -[IMPORTANT] -==== -You will need to run the algorithm to get your predictions using the non-resampled training data -- otherwise you won't have the predictions to plot! -==== - -[TIP] -==== -You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use. -==== - -[TIP] -==== -You can loop through the output files by doing something like: - -[source,python] ----- -from pathlib import Path - -for file in Path("/scratch/brown/purduealias/p10output/").glob("*.parquet"): - pass ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 2-3 sentences explaining the "other" changes in the provided script. -- 1-2 sentences describing your opinion of the changes. -- `p10_job.sh`. -- Your resulting graphic -- make sure it renders properly when viewed in Gradescope. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project11.adoc deleted file mode 100644 index d3624d4d5..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project11.adoc +++ /dev/null @@ -1,409 +0,0 @@ -= STAT 39000: Project 11 -- Spring 2022 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries — `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the third of a series of 4 projects focused on using `pytorch` and `JAX` to solve numeric problems. - -**Scope:** Python, JAX - -.Learning Objectives -**** -- Compare and contrast `pytorch` and `JAX`. -- Differentiate functions using `JAX`. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- Vectorize functions using the `vmap` function from `JAX`. -- How do random number generators work in `JAX`? -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/sim/train.csv` - -== Questions - -=== Question 1 - -`JAX` is a library for high performance computing. It falls into the same category as other popular packages like: `numpy`, `pytorch`, and `tensorflow`. `JAX` is a product of Google / Deepmind that takes a completely different approach than their other product, `tensorflow`. - -Like the the other popular libraries, `JAX` can utilize GPUs/TPUs to greatly speed up computation. Let's take a look. - -Here is a snippet of code from previous projects that uses `pytorch` and calculates predictions 10000 times. - -[NOTE] -==== -Of course, this is the same calculation since our betas aren't being updated yet, but just bear with me. -==== - -[source,python] ----- -import pandas as pd -import torch -import jax -import jax.numpy as jnp - -dat = pd.read_csv("/depot/datamine/data/sim/train.csv") ----- - -[source,python] ----- -%%time - -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) - -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) - -num_epochs = 10000 - -for idx in range(num_epochs): - - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 ----- - -Approximately how much time does it take to run this second chunk of code (after we have already read in our data)? - -Here is the equivalent `JAX` code: - -[source,python] ----- -%%time - -x_train = jnp.array(dat['x'].to_numpy()) -y_train = jnp.array(dat['y'].to_numpy()) - -beta0 = 5 -beta1 = 4 -beta2 = 3 - -num_epochs = 10000 - -for idx in range(num_epochs): - - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 ----- - -How much time does this take? - -At this point in time you may be questioning how `JAX` could possibly be worth it. At first glance, the new code _does_ look a bit cleaner, but not clean enough to use code that is around 3 times slower. - -This is where `JAX` first trick, or _transformation_ comes in to play. When we refer to _transformation_, think of it as an operation on some function that produces another function as an output. - -The first _transformation_ we will talk about is `jax.jit`. "JIT" stands for "Just In Time" and refers to a "Just in time" compiler. Essentially, just in time compilation is a trick that can be used to _greatly_ speed up the execution of _some_ code by compiling the code. In a nutshell, the compiled version of the code has a wide variety of optimizations that speed your code up. - -Lots of our computation time is spent inside our loop, specifically when we are calculating our `y_predictions`. Let's see if we can use the jit transformation to speed up our `JAX` code with little to no extra effort. - -Write a function called `model` that accepts two arguments. The first argument is a tuple containing our parameters: `beta0`, `beta1`, and `beta2`. The second is our _input_ to our function (our x values) called `x`. `model` should then _unpack_ our tuple of parameters into `beta0`, `beta1`, and `beta2`, and then return predictions (the same formula shown above, twice). Replace the code as follows. - -[source,python] ----- -# replace this line -y_predictions = beta0 + beta1*x_train + beta2*x_train**2 - -# with -y_predictions = model((beta0, beta1, beta2), x_train) ----- - -Run and time the code again. No difference? Well, we didn't use our jit transformation yet! Using the transformation is easy. `JAX` provides two equivalent ways. You can either _decorate_ your `model` function with the `@jax.jit` https://realpython.com/primer-on-python-decorators/[decorator], or simply apply the transformation to your function and save the new, jit compiled function and use _it_ instead. - -[source,python] ----- -def my_func(x): - return x**2 - -@jax.jit -def my_func_jit1(x): - return x**2 - -my_func_jit2 = jax.jit(my_func) ----- - -Re-run your code using the JIT transformation. Is it faster now? - -[NOTE] -==== -It is important to note that `pytorch` _does_ have some `jit` functionality, and there is also a package called `numba` which can help with this as well, however, it is not as straightforward to perform the same operation using either as it is using `JAX`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -At this point in time you may be considering slapping `@jax.jit` on all your functions -- unfortunately it is not quite so simple! First of all, the previous comparison was actually not fair at all. Why? `JAX` has asynchronous dispatch by default. What this means is that, by default, `JAX` will return control to Python as soon as possible, even if it is _before_ the function has been fully evaluated. - -What does this mean? It means that our finished example from question 1 may be returning a not-yet-complete result, greatly throwing off our performance measurements. So how can we _synchronously_ wait for execution to finish? This is easy, simply use the `block_until_ready` method built in to your jit compiled `model` function. - -[source,python] ----- -def my_func(x): - return x**2 - -@jax.jit -def my_func_jit1(x): - return x**2 - -my_func_jit2 = jax.jit(my_func) - -my_func_jit1.block_until_ready() - -# or - -my_func_jit2.block_until_ready() ----- - -Re-run your code from before -- you should find that the results are unchanged, it turns out that really _was_ a serious speedup from before. Great. Let's move on from this part of things. Back to our question. Why can't we just slap `@jax.jit` on any function and expect a speedup? - -Take the following function. - -[source,python] ----- -def train(params, x, y, epochs): - def _model(params, x): - beta0, beta1, beta2 = params - return beta0 + beta1*x + beta2*x**2 - - mses = [] - for _ in range(epochs): - y_predictions = _model(params, x_train) - mse = jnp.sum((y_predictions - y)**2) - -fast_train = jax.jit(train) - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -If you try running it you will get an error saying something along the lines of "TracerIntegerConversionError". The problem with this function, and why it cannot be jit compiled, is the `epochs` argument. By default, `JAX` tries to "trace" the parameters to determine its effect on inputs of a specific shape and type. Control flow cannot depend on traced values -- in this case, `epochs` is relied on in order to determine how many times to loop. In addition, the _shapes_ of all input and output values of a function must be able to be determined ahead of time. - -How do we fix this? Well, it is not always possible, however, we _can_ choose to select parameters to be _static_ or not traced. If a parameter is marked as static, or not traced, it can be JIT compiled. The catch is that any time a call to the function is made and the value of the static parameter is changed, the function will have to be recompiled with that new static value. So, this is only useful if you will only occasionally change the parameter. This sounds like our case! We only want to occasionally change the number of epochs, so perfect. - -You can mark a parameter as static by specifying the argument position using the `static_argnums` argument to `jax.jit`, or by specifying the argument _name_ using the `static_argnames` argument to `jax.jit`. - -Force the `epochs` argument to be static, and use the `jax.jit` decorator to compile the function. Test out the function, in order using the following code cells. - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 9999) ----- - -Do your best to explain why the last code cell was once again slower. - -[TIP] -==== -If you aren't sure why, reread the question text -- we hint at the "catch" in the text. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -We learned that one of the coolest parts of the `pytorch` package was the automatic differentiation feature. It saves a _lot_ of time doing some calculus and coding up resulting equations. Recall that in `pytorch` this differentiation was baked into the `backward` method of our MSE. This is quite different from the way we think about the equations when looking at the math, and is quite confusing. - -`JAX` has the same functionality, but it is _much_ cleaner and easier to use. We will provide you with a simple example, and explain the math as we go along. - -Let's say our function is $f(x) = 2x^2$. We can start by writing a function. - -[source,python] ----- -def squared(x): - return x**2 ----- - -Fantastic, so far pretty easy. - -The derivative w.r.t. `x` is $4x$. Doing this in `JAX` is as easy as applying the `jax.grad` _transformation_ to the function. - -[source,python] ----- -squared_deriv = jax.grad(squared) ----- - -Okay, test out both functions as follows. - -[source,python] ----- -my_array = jnp.array([1.0, 2.0, 3.0]) - -squared(4.0) # 16.0 -squared(my_array) # [1.0, 4.0, 9.0] -squared_deriv(4.0) # 16.0 -squared_deriv(my_array) # uh oh! Something went wrong! ----- - -[IMPORTANT] -==== -A very perceptive student pointed out that we originally passed array values that were ints to `jax.grad`. This will fail. You can read more about why https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#non-array-inputs-numpy-vs-jax[here]. -==== - -On the last line, you probably received a message or error saying something along the lines of "Gradient only defined for scalar-ouput functions. What this means is that the resulting derivative function is not _vectorized_. As you may have guessed, this is easily fixed. Another key _transformation_ that `JAX` provides is called `vmap`. `vmap` takes a function and creates a vectorized version of the function. See the following. - -[source,python] ----- -vectorized_deriv_squared = jax.vmap(squared_deriv) -vectorized_deriv_squared(my_array) # [4.0, 8.0, 12.0] ----- - -Heck yes! That is pretty cool, and very powerful. It is _so_ much more understandable than the magic happening in the `pytorch` world too! - -Dig back into your memory about any equation you may have had in the past where you needed to find a derivative. Create a Python function, find the derivative, and test it out on both a single value, like `4.0` as well as an array, like `jnp.array([1.0,2.0,3.0])`. Don't hesitate to make it extra fun and include some functions like `jnp.cos`, `jnp.sin`, etc. Did everything work as expected? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, great, but that was a straight-forward example. What if we have multiple parameters we'd like to take partial derivatives with respect to? `jax.grad` can handle that too! - -Read https://jax.readthedocs.io/en/latest/jax-101/01-jax-basics.html#jax-first-transformation-grad[this] excellent example in the official JAX documentation. - -[NOTE] -==== -The JAX documentation is pretty excellent! If you are interested, I would recommend reading through it, it is very well written. -==== - -Given the following (should be familiar) model, create a function called `get_partials` that accepts an argument `params` (a tuple containing beta0, beta1, and beta2, in order) and an argument `x`, that can be either a single value (a scalar), or a `jnp.array` with multiple values. This function should return a single value for each of the 3 partial derivatives, where `x` is plugged into each of the 3 partial derivatives to calculate each value, OR, 3 arrays of results where there are 3 values for each value in the input array. - -[source,python] ----- -@jax.jit -def model(params, x): - beta0, beta1, beta2 = params - return beta0 + beta1*x + beta2*x**2 ----- - -.example using it -[source,python] ----- -model((1.0, 2.0, 3.0), 4.0) # 57 -model((1.0, 2.0, 3.0), jnp.array((4.0, 5.0, 6.0))) # [57, 86, 121] ----- - -Since we have 3 parameters, we will have 3 partial derivatives, and our new function should output a value for each of our 3 partial derivatives, for each value passed as `x`. To be explicit and allow you to check your work, the results should be the same as the following. - -[source,python] ----- -params = (5.0, 4.0, 3.0) -get_partials(params, x_train) ----- - -.output ----- -((DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32, weak_type=True), - DeviceArray([-15.94824 , -11.117526 , -10.4780855 , -8.867778 , - -8.799367 , -8.140428 , -7.8744955 , -7.72306 , - -6.9281745 , -6.2731333 , -6.2275624 , -5.7271757 , - -5.1857414 , -5.150156 , -4.8792663 , -4.663747 , - -4.58701 , -4.1310377 , -4.0215836 , -4.019455 , - -3.5578184 , -3.4748363 , -3.4004524 , -3.1221437 , - -3.0421085 , -2.941131 , -2.8603644 , -2.8294718 , - -2.7050996 , -1.9493109 , -1.7873074 , -1.2773769 , - -1.1804487 , -1.1161369 , -1.1154363 , -0.8590109 , - -0.81457555, -0.7386795 , -0.57577926, -0.5536533 , - -0.51964295, -0.12334588, 0.11549235, 0.14650635, - 0.24305418, 0.2876291 , 0.3942046 , 0.6342466 , - 0.8256681 , 1.2047065 , 1.9168468 , 1.9493027 , - 1.9587051 , 2.3490443 , 2.7015095 , 2.8161156 , - 2.8648841 , 2.946292 , 3.1312609 , 3.1810293 , - 4.503682 , 5.114829 , 5.1591663 , 5.205859 , - 5.622392 , 5.852435 , 6.21313 , 6.4066596 , - 6.655888 , 6.781989 , 7.1651325 , 7.957219 , - 8.349893 , 11.266327 , 13.733376 ], dtype=float32, weak_type=True), - DeviceArray([2.54346375e+02, 1.23599388e+02, 1.09790276e+02, - 7.86374817e+01, 7.74288559e+01, 6.62665634e+01, - 6.20076790e+01, 5.96456566e+01, 4.79996033e+01, - 3.93521996e+01, 3.87825356e+01, 3.28005409e+01, - 2.68919144e+01, 2.65241070e+01, 2.38072395e+01, - 2.17505341e+01, 2.10406590e+01, 1.70654716e+01, - 1.61731339e+01, 1.61560173e+01, 1.26580715e+01, - 1.20744877e+01, 1.15630760e+01, 9.74778175e+00, - 9.25442410e+00, 8.65025234e+00, 8.18168449e+00, - 8.00591087e+00, 7.31756353e+00, 3.79981303e+00, - 3.19446778e+00, 1.63169169e+00, 1.39345896e+00, - 1.24576163e+00, 1.24419820e+00, 7.37899661e-01, - 6.63533330e-01, 5.45647442e-01, 3.31521749e-01, - 3.06531966e-01, 2.70028800e-01, 1.52142067e-02, - 1.33384829e-02, 2.14641113e-02, 5.90753369e-02, - 8.27304944e-02, 1.55397251e-01, 4.02268738e-01, - 6.81727827e-01, 1.45131791e+00, 3.67430139e+00, - 3.79978085e+00, 3.83652544e+00, 5.51800919e+00, - 7.29815340e+00, 7.93050718e+00, 8.20756149e+00, - 8.68063641e+00, 9.80479431e+00, 1.01189480e+01, - 2.02831535e+01, 2.61614761e+01, 2.66169968e+01, - 2.71009693e+01, 3.16112938e+01, 3.42509956e+01, - 3.86029854e+01, 4.10452881e+01, 4.43008461e+01, - 4.59953766e+01, 5.13391228e+01, 6.33173370e+01, - 6.97207031e+01, 1.26930122e+02, 1.88605606e+02], dtype=float32, weak_type=True)),) ----- - -[source,python] ----- -get_partials((1.0,2.0,3.0), jnp.array((4.0,))) ----- - -.output ----- -((DeviceArray([1.], dtype=float32, weak_type=True), - DeviceArray([4.], dtype=float32, weak_type=True), - DeviceArray([16.], dtype=float32, weak_type=True)),) ----- - -[TIP] -==== -To specify which arguments to take the partial derivative with respect to, use the `argnums` argument to `jax.grad`. In our case, our first argument is really 3 parameters all at once, so if you did `argnums=(0,)` it would take 3 partial derivatives. If you specified `argnums=(0,1)` it would take 4 -- that last one being with respect to x. -==== - -[TIP] -==== -To vectorize your resulting function, use `jax.vmap`. This time, since we have many possible arguments, we will need to specify the `in_axes` argument to `jax.vmap`. `in_axes` will accept a tuple of values -- one value per parameter to our function. Since our function has 2 arguments: `params` and `x`, this tuple should have 2 values. We should put `None` for arguments that we don't want to vectorize over (in this case, `params` stays the same for each call, so the associated `in_axes` value for `params` should be `None`). Our second argument, `x`, should be able to be a vector, so you should put `0` for the associated `in_axes` value for `x`. - -This is confusing! However, considering how powerful and all that is baked into the `get_partials` function, it is probably acceptable to have to sit an think a bit to figure this out. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12-teachingprogramming.adoc deleted file mode 100644 index d20736e77..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12-teachingprogramming.adoc +++ /dev/null @@ -1,231 +0,0 @@ -= STAT 39000: Project 12 -- Spring 2022 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the pytorch and tensorflow libraries — JAX is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the last of a series of 4 projects focused on using pytorch and JAX to solve numeric problems. - -**Scope:** Python, JAX - -.Learning Objectives -**** -- Compare and contrast pytorch and JAX. -- Differentiate functions using JAX. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- Vectorize functions using the vmap function from JAX. -- How do random number generators work in JAX? -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/` - -== Questions - -=== Question 1 - -Last weeks project was a bit fast paced, so we will slow things down considerablyto try and compensate, and give you a chance to digest and explore more. We will: - -- Learn how `JAX` handles generating random numbers differently than most other packages. -- Write a function in `numpy` to calculate the Hamming distance between a given image hash and the remaining (around 123k) image hashes. -- Play around with the hash data and do some sanity checks. - -Let's start by taking a look at the documentation for https://jax.readthedocs.io/en/latest/jax-101/05-random-numbers.html[random number generation]. Carefully read the page -- it's not that long. - -The documentation gives the following example. - -[source,python] ----- -import numpy as np - -np.random.seed(0) - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo(): return bar() + 2 * baz() - -print(foo()) ----- - -It then goes on to say that `JAX` may try to parallelize the `bar` and `baz` functions. As a result, we would not know which would run first, `bar` or `baz`. This would change the results of `foo`. Below, we've modified the code to emulate this. - -[source,python] ----- -import numpy as np -import random - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo1(): return bar() + 2 * baz() - -def foo2(): return 2*baz() + bar() - -def foo(*funcs): - functions = list(funcs) - random.shuffle(functions) - return functions[0]() ----- - -[source,python] ----- -np.random.seed(0) -foo(foo1, foo2) ----- - -.output ----- -# sometimes this -1.9791922366721637 - -# sometimes this -1.812816374227069 ----- - -`JAX` has a much different way of dealing with this. While the solution is clean and effective, and allows such code to be parallelized, it _can_ be a bit more cumbersome managing and passing around keys. Create a modified version of this code using `JAX`, and passing around keys. Fill in the `?` parts. - -[source,python] ----- -import numpy as np - -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=?) - -def bar(key): - return ? - -def baz(key): - return ? - -def foo1(key1, key2): - return bar(key1) + 2 * baz(key2) - -def foo2(key1, key2): - return 2*baz(key2) + bar(key1) - -def foo(funcs, keys): - functions = list(funcs) - random.shuffle(functions) - return ? ----- - -[source,python] ----- -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=3) -print(foo((foo1, foo2), (subkeys[0], subkeys[1]))) ----- - -.output ----- -# always -2.3250647 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write a function called `get_distances_np` that accepts a filename (as a string) (`fm_hash`), and a path (as a string) (`path`). - -`get_distances_np` should return a numpy array of the distances between the hash for `fm_hash` and every other image hash in `path`. - -For this question, use the dataset of hashed images found in `/depot/datamine/data/coco/hashed02/`. An example of a call to `get_distances_np` would look like the following. - -[source,python] ----- -from pathlib import Path -import imagehash -import numpy as np ----- - -[source,python] ----- -%%time - -hshs = get_distances_np("000000000008.jpg", "/depot/datamine/data/coco/hashed02/") -hshs.shape # (123387, 1) ----- - -How long does it take to run this function? - -Make plots and/or summary statistics to check out the distribution of the distances. How does it look? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -What do you think about the design of the `get_distances_np` function, considering that we are interested in pairwise Hamming distances? - -At its core, we essentially have a vector of 123k values. If we were to get the pairwise distances, the resulting distances would fill the upper triangle of a 123k by 123k matrix. This would be a _very large_ amount of data, considering it is just numeric data -- more than can easily fit in memory. - -In addition, the part of the function from question 2 that takes the majority of the run time is _not_ the numeric computations, but rather the opening and reading of the 123k hashes. Approximately 55 of the 65-70 seconds. With this in mind, let's back up, and break this problem down further. - -Write a code cell containing code that will read in all of the hashes into a `numpy` array of size (123387, 64). - -This array contains the hashes for each of the 123k images. Each row is the hash of an image. Let's call the resulting (123387, 64) array `hashes`. - -Given what we know, the following is a very fast function that will find the Hamming distances between a single image and all of the other images. - -[source,python] ----- -def hamming_distance(hash1, hash2): - return np.sum(~(hash1 == hash2), axis=1) ----- - -[source,python] ----- -%%time - -hamming_distance(hashes[0], hashes) ----- - -This runs in approximately 46 ms. This would be about 94-95 minutes if we did this calculation for each pair. - -Convert your `numpy` array into a `JAX` array, and create an equivalent function. How fast does this function run? What would the approximate runtime be for the total calculation? - -[IMPORTANT] -==== -Remember to use `jax.jit` to speed up the function. Also recall that the first run of the compiled function will be _slow_ since it needs to be compiled. After that, future uses of the function will be faster. -==== - -Make sure to take into consideration the slower first run. What would the approximate total runtime be using the `JAX` function? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Don't worry, I'm not going to make you run these calculations. Instead, answer one of the following two questions. - -. Pick 2 images / image hashes and get the closest 3 images by Hamming distance for each. Note the distances and display the images. At those distances, can you perceive any sort of "closeness" in image? -. Randomly sample (using `JAX` methods) _n_ (more than 4, please) images and calculate all of the pairwise distances. Create a set of plots showing the distributions of distances. Explore the distances, and the dataset, and write 1-2 sentences about any interesting observations you are able to make, or 1-2 sentences on how you could use the information to do something cool. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12.adoc deleted file mode 100644 index d20736e77..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project12.adoc +++ /dev/null @@ -1,231 +0,0 @@ -= STAT 39000: Project 12 -- Spring 2022 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the pytorch and tensorflow libraries — JAX is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the last of a series of 4 projects focused on using pytorch and JAX to solve numeric problems. - -**Scope:** Python, JAX - -.Learning Objectives -**** -- Compare and contrast pytorch and JAX. -- Differentiate functions using JAX. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- Vectorize functions using the vmap function from JAX. -- How do random number generators work in JAX? -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/depot/datamine/data/` - -== Questions - -=== Question 1 - -Last weeks project was a bit fast paced, so we will slow things down considerablyto try and compensate, and give you a chance to digest and explore more. We will: - -- Learn how `JAX` handles generating random numbers differently than most other packages. -- Write a function in `numpy` to calculate the Hamming distance between a given image hash and the remaining (around 123k) image hashes. -- Play around with the hash data and do some sanity checks. - -Let's start by taking a look at the documentation for https://jax.readthedocs.io/en/latest/jax-101/05-random-numbers.html[random number generation]. Carefully read the page -- it's not that long. - -The documentation gives the following example. - -[source,python] ----- -import numpy as np - -np.random.seed(0) - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo(): return bar() + 2 * baz() - -print(foo()) ----- - -It then goes on to say that `JAX` may try to parallelize the `bar` and `baz` functions. As a result, we would not know which would run first, `bar` or `baz`. This would change the results of `foo`. Below, we've modified the code to emulate this. - -[source,python] ----- -import numpy as np -import random - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo1(): return bar() + 2 * baz() - -def foo2(): return 2*baz() + bar() - -def foo(*funcs): - functions = list(funcs) - random.shuffle(functions) - return functions[0]() ----- - -[source,python] ----- -np.random.seed(0) -foo(foo1, foo2) ----- - -.output ----- -# sometimes this -1.9791922366721637 - -# sometimes this -1.812816374227069 ----- - -`JAX` has a much different way of dealing with this. While the solution is clean and effective, and allows such code to be parallelized, it _can_ be a bit more cumbersome managing and passing around keys. Create a modified version of this code using `JAX`, and passing around keys. Fill in the `?` parts. - -[source,python] ----- -import numpy as np - -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=?) - -def bar(key): - return ? - -def baz(key): - return ? - -def foo1(key1, key2): - return bar(key1) + 2 * baz(key2) - -def foo2(key1, key2): - return 2*baz(key2) + bar(key1) - -def foo(funcs, keys): - functions = list(funcs) - random.shuffle(functions) - return ? ----- - -[source,python] ----- -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=3) -print(foo((foo1, foo2), (subkeys[0], subkeys[1]))) ----- - -.output ----- -# always -2.3250647 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write a function called `get_distances_np` that accepts a filename (as a string) (`fm_hash`), and a path (as a string) (`path`). - -`get_distances_np` should return a numpy array of the distances between the hash for `fm_hash` and every other image hash in `path`. - -For this question, use the dataset of hashed images found in `/depot/datamine/data/coco/hashed02/`. An example of a call to `get_distances_np` would look like the following. - -[source,python] ----- -from pathlib import Path -import imagehash -import numpy as np ----- - -[source,python] ----- -%%time - -hshs = get_distances_np("000000000008.jpg", "/depot/datamine/data/coco/hashed02/") -hshs.shape # (123387, 1) ----- - -How long does it take to run this function? - -Make plots and/or summary statistics to check out the distribution of the distances. How does it look? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -What do you think about the design of the `get_distances_np` function, considering that we are interested in pairwise Hamming distances? - -At its core, we essentially have a vector of 123k values. If we were to get the pairwise distances, the resulting distances would fill the upper triangle of a 123k by 123k matrix. This would be a _very large_ amount of data, considering it is just numeric data -- more than can easily fit in memory. - -In addition, the part of the function from question 2 that takes the majority of the run time is _not_ the numeric computations, but rather the opening and reading of the 123k hashes. Approximately 55 of the 65-70 seconds. With this in mind, let's back up, and break this problem down further. - -Write a code cell containing code that will read in all of the hashes into a `numpy` array of size (123387, 64). - -This array contains the hashes for each of the 123k images. Each row is the hash of an image. Let's call the resulting (123387, 64) array `hashes`. - -Given what we know, the following is a very fast function that will find the Hamming distances between a single image and all of the other images. - -[source,python] ----- -def hamming_distance(hash1, hash2): - return np.sum(~(hash1 == hash2), axis=1) ----- - -[source,python] ----- -%%time - -hamming_distance(hashes[0], hashes) ----- - -This runs in approximately 46 ms. This would be about 94-95 minutes if we did this calculation for each pair. - -Convert your `numpy` array into a `JAX` array, and create an equivalent function. How fast does this function run? What would the approximate runtime be for the total calculation? - -[IMPORTANT] -==== -Remember to use `jax.jit` to speed up the function. Also recall that the first run of the compiled function will be _slow_ since it needs to be compiled. After that, future uses of the function will be faster. -==== - -Make sure to take into consideration the slower first run. What would the approximate total runtime be using the `JAX` function? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Don't worry, I'm not going to make you run these calculations. Instead, answer one of the following two questions. - -. Pick 2 images / image hashes and get the closest 3 images by Hamming distance for each. Note the distances and display the images. At those distances, can you perceive any sort of "closeness" in image? -. Randomly sample (using `JAX` methods) _n_ (more than 4, please) images and calculate all of the pairwise distances. Create a set of plots showing the distributions of distances. Explore the distances, and the dataset, and write 1-2 sentences about any interesting observations you are able to make, or 1-2 sentences on how you could use the information to do something cool. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project13.adoc deleted file mode 100644 index 9e7595705..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project13.adoc +++ /dev/null @@ -1,145 +0,0 @@ -= STAT 39000: Project 13 -- Spring 2022 - -**Motivation:** This year you've been exposed to a _lot_ of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn't happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out. - -**Context:** This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn't be practical to make this exhaustive, but the idea is to _think_ about and _plan out_ the various steps in a project and execute it the best you can given time and resource constraints. - -**Scope:** Python - -.Learning Objectives -**** -- Describe and plan out an ETL pipeline to solve a problem of interest. -- Create a flowchart mapping out the steps in the project. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Create a problem statement for the project. What question are you interested in answering? What theory do you have that you'd like to show to maybe be true? This could be _anything_. Some examples could be: - -- Should you draft running backs before wide receivers in fantasy football? -- Are news articles more "positive" or "negative" on nytimes.com vs. washingtonpost.com? -- Are the number of stars of an Amazon review important to tell if the review is fake or not? -- Are flight delays more likely to happen in the summer or winter? - -The question you want to answer can be as simple or complex as you want it to be. - -[IMPORTANT] -==== -When coming up with the problem statement, please take into consideration that in this project, and the next, we will ask you to utilize skills you were exposed to this year. Things like: SLURM, `joblib`, `pytorch`, `JAX`, docker/singularity, `fastapi`, sync/async, `pdoc`, `pytest`, etc. It is likely that you will want to use other skills from previous years as well. Things like: web scraping, writing scripts, data wrangling, SQL, etc. - -Try to think of a question that _could_ be solved by utilizing some of these skills. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Read about ETL pipelines https://en.wikipedia.org/wiki/Extract,_transform,_load[here]. Summarize each part of the pipeline (extract, transform, and load) in your own words. Follow this up by looking at the image at the top of https://r4ds.had.co.nz/introduction.html[this] section of "R for Data Science". Where do you think the ETL pipeline could be added to this workflow? Read about Dr. Wickhams definition of https://r4ds.had.co.nz/tidy-data.html[tidy data]. After reading about his definition, do you think the "Tidy" step in the chart is potentially different than the "transform" step in the ETL pipeline? - -[NOTE] -==== -There are no correct answer to this question. Just think about the question and describe what you think. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Flowcharts are an incredibly useful tool that can help you visualize and plan a project from start to end. Flowcharts can help you realize what parts of the project you are not clear on, which could save a lot of work during implementation. Read about the various flowchart shapes https://www.rff.com/flowchart_shapes.php[here], and plan out your ETL pipeline and the remaining project workflow using https://www.draw.io/index.html[this] free online tool. xref:book:projects:templates.adoc#including-an-image-in-your-notebook[Include the image] of your flowchart in your notebook. - -[NOTE] -==== -You are not required to follow this flow chart exactly. You will have an opportunity to point out any changes you ended up making to your project flow later on. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -There will more or less be a few "major" steps in your project: - -- **Extract:** scrape, database queries, find and download data files, etc. -- **Transform:** data wrangling using `pandas`, `tidyverse`, `JAX`, `numpy`, etc. -- **Load:** load data into a database or a file that represents your "data warehouse". -- **Import/tidy:** Grab data from your "data warehouse" and tidy it if necessary. -- **Iterate:** Modify/visualize/model your data. -- **Communicate:** Share your deliverable(s). - -[NOTE] -==== -Of course, you don't _need_ to include all of these steps. Any well-planned approach will receive full credit. -==== - -This can be further boiled down to just a few steps: - -- Data collection/cleaning. -- Analysis/modeling/visualization. -- Report. - -Implement your data collection/cleaning step. Be sure to submit any relevant files and code (e.g. python script(s), R script(s), simply some code cells in a Jupyter Notebook, etc.) in your submission. - -To get full credit, simply choose at least 2 of the following skills to incorporate into this step (or these steps): - -- https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[Google style docstrings], or https://style.tidyverse.org/documentation.html[tidyverse style comments] if utilizing R. -- Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably). -- sync/async code (if, for example, you wanted to speed up code that has a lot of I/O). -- `joblib` (if, for example, you wanted to speed up the scraping of many files). -- `SLURM` (if, for example, you wanted to speed up the scraping of many files). -- `requests`/`selenium` (if, for example, you need to scrape data as a part of your collection process). -- If you choose to use `sqlite` as your intermediate "data warehouse" (instead of something easier like a csv or parquet file), this will count as a skill. -- If you use `argparse` and build a functioning Python script, this will count as a skill. -- If you write `pytest` tests for your code, this will count as a skill. - -[IMPORTANT] -==== -Make sure to include a screenshot or two actually _using_ your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -If you read about ETL pipelines, you are probably not exactly sure what a "data warehouse" is. Browse the internet and read about data warehouses. In your own words, summarize what a data warehouse is, and the typical components. - -Here are some common data warehouse products: - -- Snowflake -- Google BigQuery -- Amazon Redshift -- Apache Hive -- Databricks Lakehouse Platform - -Choose a product to read about and describe 2-3 things that it looks like the product can do, and explain why (or when) you think that functionality would be useful. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project14.adoc deleted file mode 100644 index 574136f56..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-project14.adoc +++ /dev/null @@ -1,112 +0,0 @@ -= STAT 39000: Project 14 -- Spring 2022 - -**Motivation:** This year you've been exposed to a _lot_ of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn't happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out. - -**Context:** This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn't be practical to make this exhaustive, but the idea is to _think_ about and _plan out_ the various steps in a project and execute it the best you can given time and resource constraints. - -**Scope:** Python - -.Learning Objectives -**** -- Describe and plan out an ETL pipeline to solve a problem of interest. -- Create a flowchart mapping out the steps in the project. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -[WARNING] -==== -If you skipped project 13, please go back and complete project 13 and submit it as your project 14 submission. Please make a bold note at the top of your submission "This is my project 14, but is really project 13", so graders know what to expect. Thanks! -==== - -In the previous project, you _probably_ spent most of the time reading about ETL, flowcharts, data warehouses, and planning out your project. The more you have things planned out the less amount of time it will likely take to implement. Your project probably looks something like this now. - -* [x] Data collection/cleaning. -* [ ] Analysis/modeling/visualization. -* [ ] Report. - -In this project, you will complete those last two steps. - -Import data from your "data warehouse" and perform an analysis to answer the problem statement you created in the previous project. Your analysis should contain: - -- 1 or more data visualizations. -- 1 or more sets of summary data (think `.describe()` from `pandas` or `summary`/`prop.table` from R). - -[NOTE] -==== -Feel free to utilize the `transformers` package and the wide variety of pre-built models provided at https://huggingface.co/models[huggingface]. -==== - -Alternatively, you can build an API and/or dashboard using `fastapi` (or any other framework like `django`, `flask`, `shiny`, etc.). Simply make sure to include your code and screenshots of you utilizing the API or using the dashboard. - -For _either_ of the options above (summary data/visualizations or API/dashboard), in order to get full credit, simply choose at least 2 of the following skills to incorporate inot this step (or these steps): - -- https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[Google style docstrings], or https://style.tidyverse.org/documentation.html[tidyverse style comments] if utilizing R. -- Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably, or run your API/dashboard). -- sync/async code (if, for example, you wanted to speed up code that has a lot of I/O). -- `joblib` (if, for example, you wanted to speed up a parallelizable task or computation). -- `SLURM` (if, for example, you wanted to speed up a parallelizable task or computation). -- If you use `argparse` and build a functioning Python script, this will count as a skill. -- If you write `pytest` tests for your code, this will count as a skill. -- Use `JAX` (for example `jax.jit`) or `pytorch` for some numeric computation. - -[IMPORTANT] -==== -Make sure to include a screenshot or two actually _using_ your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The final task, to create your deliverable to communicate your results, is probably the most important part of a typical project. It is important that people understand what you did, why it is important for answering your question, and why it provides value. Learning how to make a good slide deck is a really useful skill! - -In our case, it makes more sense to have a Jupyter Notebook, since those are easy to read in Gradescope, and get the point across. - -After your question 1 results are entered in your Jupyter Notebook, under the "Question 2" heading, create your deliverable. Use markdown cells to beautifully format the information you want to present. Include everything starting with your problem statement, leading all the way up to your conclusions (even if just anecdotal conclusions). Include code, graphics, and screenshots that are important to the story. Of course, you don't need to include code from scripts (in the notebook -- we _do_ want all scripts from question 1 (if any) included in your submission), but you can mention that you had a script called `my_script.py` that did X, Y, and Z. - -The goal of this deliverable is that an outsider could read your notebook (starting from question 2) and understand what question you had, what you did (and why), and what were the results. Any good effort will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -In the previous project, you were asked to create a flow chart to describe the steps in your system/project. As you began implementing things, you may or may not have changed your original plan. If you did, update your flowchart and include it in your notebook. Otherwise, include your old flow chart and explain that you didn't change anything. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -It has been a fun year. We hope that you learned something new! - -- Write 3 (or more) of your least favorite topics and/or projects from this past year (for STAT 39000). -- Write 3 (or more) of your most favorite projects/topics, and/or 3 topics you wish you were able to learn _more_ about. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-projects.adoc deleted file mode 100644 index 704629bdb..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/39000/39000-s2022-projects.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= STAT 39000 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$39000-s2022-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -=== Sign up - -https://piazza.com/purdue/fall2021/stat39000[https://piazza.com/purdue/fall2021/stat39000] - -=== Link - -https://piazza.com/purdue/fall2021/stat39000/home[https://piazza.com/purdue/fall2021/stat39000/home] - -== Syllabus - -See xref:spring2022/logistics/s2022-syllabus.adoc[here]. diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/19000-s2022-officehours.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/19000-s2022-officehours.adoc deleted file mode 100644 index 310ec24db..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/19000-s2022-officehours.adoc +++ /dev/null @@ -1,444 +0,0 @@ -= STAT 19000 Office Hours for Spring 2022 - -It might be helpful to also have the office hours for STAT 29000 and STAT 39000: - -xref:29000-s2022-officehours.adoc[STAT 29000 Office Hours for Spring 2022] - -xref:39000-s2022-officehours.adoc[STAT 39000 Office Hours for Spring 2022] - -and it might be helpful to look at the -xref:officehours.adoc[general office hours policies]. - -The STAT 19000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Melissa Cai Shi -|19000 -|https://purdue.webex.com/meet/mcaishi - -|Nihar Chintamaneni -|19000 -|https://purdue-student.webex.com/meet/chintamn - -|Sumeeth Guda -|19000 -|https://purdue-student.webex.com/meet/sguda - -|Jonah Hu -|19000 -|https://purdue-student.webex.com/meet/hu625 - -|Darren Iyer -|19000 -|https://purdue-student.webex.com/meet/iyerd - -|Pramey Kabra -|19000 -|https://purdue-student.webex.com/meet/kabrap - -|Ishika Kamchetty -|19000 -|https://purdue-student.webex.com/meet/ikamchet - -|Jackson Karshen -|19000 -|https://purdue-student.webex.com/meet/jkarshe - -|Bhargavi Katuru -|19000 -|https://purdue-student.webex.com/meet/bkaturu - -|Michael Kruse -|19000 -|https://purdue-student.webex.com/meet/kruseml - -|Ankush Maheshwari -|19000 -|https://purdue-student.webex.com/meet/mahesh20 - -|Hyeong Park -|19000 -|https://purdue-student.webex.com/meet/park1119 - -|Vandana Prabhu -|19000 -|https://purdue-student.webex.com/meet/prabhu11 - -|Meenu Ramakrishnan -|19000 -|https://purdue-student.webex.com/meet/ramakr20 - -|Rthvik Raviprakash -|19000 -|https://purdue-student.webex.com/meet/rravipra - -|Chintan Sawla -|19000 -|https://purdue-student.webex.com/meet/csawla - -|Mridhula Srinivasa -|19000 -|https://purdue-student.webex.com/meet/sriniv99 - -|Tanya Uppal -|19000 -|https://purdue-student.webex.com/meet/tuppal - -|Keerthana Vegesna -|19000 -|https://purdue-student.webex.com/meet/vvegesna - -|Maddie Woodrow -|19000 -|https://purdue-student.webex.com/meet/mwoodrow - -|Adrienne Zhang -|19000 -|https://purdue-student.webex.com/meet/zhan4000 -|=== - -[cols="1,1,1,1,1,1,1"] -|=== -|Time (ET) |Sunday |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:30 AM - 9:00 AM -| -.2+|Seminar: **Dr Ward**, Melissa Cai Shi, Bhargavi Katuru -|Michael Kruse -| -| -|Hyeong Kyun Park - - -|9:00 AM - 9:30 AM -| -|Michael Kruse -| -| -|Hyeong Kyun Park - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -| -.2+|Seminar: **Dr Ward**, Melissa Cai Shi, Bhargavi Katuru, Michael Kruse -|Michael Kruse -|Melissa Cai Shi -|Michael Kruse -|Hyeong Kyun Park - -|10:00 AM - 10:30 AM -| -| -|Melissa Cai Shi, Chintan Sawla -|Michael Kruse, Maddie Woodrow -|Chintan Sawla, Hyeong Kyun Park - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -| -.2+|Seminar: **Dr Ward**, Melissa Cai Shi, Bhargavi Katuru, Jonah Hu, Maddie Woodrow -| -|Melissa Cai Shi, Chintan Sawla -|Michael Kruse, Maddie Woodrow -|Chintan Sawla, Hyeong Kyun Park - -|11:00 AM - 11:30 AM -| -| -|Melissa Cai Shi, Chintan Sawla -|Michael Kruse, Maddie Woodrow -|Hyeong Kyun Park, Chintan Sawla - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -|Tanya Uppal -|Chintan Sawla -|Chintan Sawla -|Maddie Woodrow, Bhargavi Katuru -|Chintan Sawla - -|12:00 PM - 12:30 PM -| -|Tanya Uppal -|Adrienne Zhang -|Vandana Prabhu -|Ishika Kamchetty, Jackson Karshen -|Vandana Prabhu, Bhargavi Katuru - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -|Pramey Kabra -|Adrienne Zhang -|Vandana Prabhu, Pramey Kabra -|Ishika Kamchetty, Jackson Karshen -|Vandana Prabhu, Nihar Chintamaneni - -|1:00 PM - 1:30 PM -| -|Pramey Kabra -|Adrienne Zhang -|Vandana Prabhu, Pramey Kabra -|Ishika Kamchetty, Jackson Karshen -|Vandana Prabhu, Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -|Pramey Kabra -| -|Vandana Prabhu, Pramey Kabra -|Ishika Kamchetty, Jackson Karshen -|Nihar Chintamaneni - -|2:00 PM - 2:30 PM -| -|Pramey Kabra -| -|Pramey Kabra, Nihar Chintamaneni -|Ishika Kamchetty, Jackson Karshen -|Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Pramey Kabra -|Mridhula Srinivasan -|Nihar Chintamaneni, Mridhula Srinivasan -|Ishika Kamchetty, Jackson Karshen -|Nihar Chintamaneni, Maddie Woodrow - -|3:00 PM - 3:30 PM -| -|Mridhula Srinivasan -|Mridhula Srinivasan -|Mridhula Srinivasan -|Maddie Woodrow, Tanya Uppal -|Maddie Woodrow, Bhargavi Katuru - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -|Mridhula Srinivasan -|Adrienne Zhang -|Mridhula Srinivasan -|Maddie Woodrow, Tanya Uppal -|Maddie Woodrow, Bhargavi Katuru - -|4:00 PM - 4:30 PM -| -|Mridhula Srinivasan -|Adrienne Zhang -|Mridhula Srinivasan, Tanya Uppal -|Tanya Uppal -|Maddie Woodrow, Bhargavi Katuru - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -| -.2+|Seminar: **Dr Ward**, Jackson Karshen, Ishika Kamchetty, Mridhula Srinivasan -|Adrienne Zhang -|Mridhula Srinivasan, Tanya Uppal -|Tanya Uppal -|Bhargavi Katuru - -|5:00 PM - 5:30 PM -| -|Ishika Kamchetty -|Rthvik Raviprakash, Jonah Hu -|Rthvik Raviprakash, Michael Kruse -|Pramey Kabra, Hyeong Kyun Park - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -| -|Ishika Kamchetty -|Ishika Kamchetty -|Rthvik Raviprakash, Jonah Hu -|Rthvik Raviprakash, Michael Kruse -|Pramey Kabra, Hyeong Kyun Park - -|6:00 PM - 6:30 PM -| -|Keerthana Vegesna -|Ishika Kamchetty -|Rthvik Raviprakash, Jonah Hu -|Rthvik Raviprakash, Michael Kruse -|Vandana Prabhu, Pramey Kabra - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -| -|Keerthana Vegesna -|Adrienne Zhang -|Rthvik Raviprakash, Jonah Hu -|Rthvik Raviprakash, Jackson Karshen -|Vandana Prabhu, Hyeong Kyun Park - -|7:00 PM - 7:30 PM -| -|Keerthana Vegesna -|Adrienne Zhang -|Rthvik Raviprakash, Jonah Hu -|Rthvik Raviprakash, Jackson Karshen -|Vandana Prabhu, Hyeong Kyun Park - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -|Keerthana Vegesna -|Adrienne Zhang -|Rthvik Raviprakash, Keerthana Vegesna -|Rthvik Raviprakash, Jackson Karshen -|Nihar Chintamaneni, Tanya Uppal - -|8:00 PM - 8:30 PM -| -|Keerthana Vegesna -|Adrienne Zhang -|Keerthana Vegesna, Hyeong Kyun Park -|Jonah Hu, Jackson Karshen -|Nihar Chintamaneni, Tanya Uppal - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -|Ankush Maheshwari -|Adrienne Zhang -|Ankush Maheshwari, Keerthana Vegesna -|Ankush Maheshwari, Jonah Hu -|Nihar Chintamaneni - -|9:00 PM - 9:30 PM -| -|Ankush Maheshwari -|Adrienne Zhang -|Ankush Maheshwari, Keerthana Vegesna -|Ankush Maheshwari, Jonah Hu -|Nihar Chintamaneni - -|**Time (ET)** -|**Sunday** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 PM - 10:00 PM -| -|Ankush Maheshwari -|Keerthana Vegesna -|Ankush Maheshwari, Keerthana Vegesna -|Jonah Hu, Ankush Maheshwari -|Nihar Chintamaneni - -|10:00 PM - 10:30 PM -| -|Ankush Maheshwari -|Keerthana Vegesna -|Ankush Maheshwari, Hyeong Kyun Park -|Jonah Hu, Ankush Maheshwari -| - -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/29000-s2022-officehours.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/29000-s2022-officehours.adoc deleted file mode 100644 index 48d4ff537..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/29000-s2022-officehours.adoc +++ /dev/null @@ -1,323 +0,0 @@ -= STAT 29000 and 39000 Office Hours for Fall 2021 - -[WARNING] -==== -All office hours will be held online on ##**Wednesday, February 2nd**## due to inclement weather conditions. Whether or not ##**Thursday, February 3rd**## office hours will be held online is to be determined. -==== - -It might be helpful to also have the office hours for STAT 19000: - -xref:19000-s2022-officehours.adoc[STAT 19000 Office Hours for Spring 2022] - -and it might be helpful to look at the -xref:officehours.adoc[general office hours policies]. - -The STAT 29000 and 39000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Sumeeth Guda -|29000 -|https://purdue-student.webex.com/meet/sguda - -|Darren Iyer -|29000 -|https://purdue-student.webex.com/meet/iyerd - -|Rishabh Rajesh -|29000 -|https://purdue-student.webex.com/meet/rajeshr - -|Nikhil D'Souza -|39000 -|https://purdue-student.webex.com/meet/dsouza13 -|=== - -[cols="1,1,1,1,1,1"] -|=== -|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:00 AM - 9:00 AM -| -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 AM - 9:00 AM -.2+|Seminar: **Dr Ward** -| -| -| -|Rishabh Rajesh - -|9:00 AM - 9:30 AM -| -| -| -|Rishabh Rajesh - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -.2+|Seminar: **Dr Ward**, Sumeeth Guda -| -| -| -|Rishabh Rajesh - -|10:00 AM - 10:30 AM -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -.2+|Seminar: **Dr Ward**, Sumeeth Guda -| -| -| -| - -|11:00 AM - 11:30 AM -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -| -| -| -| - -|12:00 PM - 12:30 PM -| -| -|Sumeeth Guda -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -| -|Sumeeth Guda -| -| - -|1:00 PM - 1:30 PM -| -| -|Sumeeth Guda -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -| -|Sumeeth Guda -| -| - -|2:00 PM - 2:30 PM -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer -| - -|3:00 PM - 3:30 PM -| -| -|Rishabh Rajesh -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -| -|Rishabh Rajesh -| -| - -|4:00 PM - 4:30 PM -| -| -|Rishabh Rajesh -|Sumeeth Guda -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -.2+|Seminar: **Dr Ward** -| -|Rishabh Rajesh -|Sumeeth Guda -| - -|5:00 PM - 5:30 PM -| -| -|Sumeeth Guda -|Sumeeth Guda - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -|Sumeeth Guda -| -| -| -|Sumeeth Guda - - -|6:00 PM - 6:30 PM -|Sumeeth Guda -| -|Darren Iyer -| -|Sumeeth Guda - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -| -| -|Darren Iyer -| -|Darren Iyer - -|7:00 PM - 7:30 PM -| -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer - -|8:00 PM - 8:30 PM -| -| -| -|Rishabh Rajesh -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -| -| -|Rishabh Rajesh -| -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/39000-s2022-officehours.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/39000-s2022-officehours.adoc deleted file mode 100644 index eaa39d503..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/39000-s2022-officehours.adoc +++ /dev/null @@ -1,323 +0,0 @@ -= STAT 29000 and 39000 Office Hours for Spring 2022 - -[WARNING] -==== -All office hours will be held online on ##**Wednesday, February 2nd**## due to inclement weather conditions. Whether or not ##**Thursday, February 3rd**## office hours will be held online is to be determined. -==== - -It might be helpful to also have the office hours for STAT 19000: - -xref:19000-s2022-officehours.adoc[STAT 19000 Office Hours for Spring 2022] - -and it might be helpful to look at the -xref:officehours.adoc[general office hours policies]. - -The STAT 29000 and 39000 office hours and WebEx addresses are the following: - -Webex addresses for TAs, Dr Ward, and Kevin Amstutz - -[cols="2,1,4"] -|=== -|TA Name |Class |Webex chat room URL - -|Dr Ward (seminars) -|all -|https://purdue.webex.com/meet/mdw - -|Kevin Amstutz -|all -|https://purdue.webex.com/meet/kamstut - -|Sumeeth Guda -|29000 -|https://purdue-student.webex.com/meet/sguda - -|Darren Iyer -|29000 -|https://purdue-student.webex.com/meet/iyerd - -|Rishabh Rajesh -|29000 -|https://purdue-student.webex.com/meet/rajeshr - -|Nikhil D'Souza -|39000 -|https://purdue-student.webex.com/meet/dsouza13 -|=== - -[cols="1,1,1,1,1,1"] -|=== -|Time (ET) |Monday |Tuesday |Wednesday |Thursday |Friday - -|8:00 AM - 9:00 AM -| -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 AM - 9:00 AM -.2+|Seminar: **Dr Ward** -| -| -| -|Rishabh Rajesh - -|9:00 AM - 9:30 AM -| -| -| -|Rishabh Rajesh - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|9:30 AM - 10:00 AM -.2+|Seminar: **Dr Ward**, Sumeeth Guda -| -| -| -|Rishabh Rajesh - -|10:00 AM - 10:30 AM -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|10:30 AM - 11:00 AM -.2+|Seminar: **Dr Ward**, Sumeeth Guda -| -| -| -| - -|11:00 AM - 11:30 AM -| -| -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|11:30 AM - 12:00 PM -| -| -| -| -| - -|12:00 PM - 12:30 PM -| -| -|Sumeeth Guda -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|12:30 PM - 1:00 PM -| -| -|Sumeeth Guda -| -| - -|1:00 PM - 1:30 PM -| -| -|Sumeeth Guda -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|1:30 PM - 2:00 PM -| -| -|Sumeeth Guda -| -| - -|2:00 PM - 2:30 PM -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|2:30 PM - 3:00 PM -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer -| - -|3:00 PM - 3:30 PM -| -| -|Rishabh Rajesh -| -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|3:30 PM - 4:00 PM -| -| -|Rishabh Rajesh -| -| - -|4:00 PM - 4:30 PM -| -| -|Rishabh Rajesh -|Sumeeth Guda -| - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|4:30 PM - 5:00 PM -.2+|Seminar: **Dr Ward** -| -|Rishabh Rajesh -|Sumeeth Guda -| - -|5:00 PM - 5:30 PM -| -| -|Sumeeth Guda -|Sumeeth Guda - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|5:30 PM - 6:00 PM -|Sumeeth Guda -| -| -| -|Sumeeth Guda - - -|6:00 PM - 6:30 PM -|Sumeeth Guda -| -|Darren Iyer -| -|Sumeeth Guda - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|6:30 PM - 7:00 PM -| -| -|Darren Iyer -| -|Darren Iyer - -|7:00 PM - 7:30 PM -| -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|7:30 PM - 8:00 PM -| -| -|Darren Iyer -|Rishabh Rajesh -|Darren Iyer - -|8:00 PM - 8:30 PM -| -| -| -|Rishabh Rajesh -|Darren Iyer - -|**Time (ET)** -|**Monday** -|**Tuesday** -|**Wednesday** -|**Thursday** -|**Friday** - -|8:30 PM - 9:00 PM -| -| -| -|Rishabh Rajesh -| -|=== - - diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/officehours.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/officehours.adoc deleted file mode 100644 index da044d98e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/officehours.adoc +++ /dev/null @@ -1,35 +0,0 @@ -= STAT 19000/29000/39000 Office Hours - -[WARNING] -==== -All office hours will be held online on ##**Wednesday, February 2nd**## due to inclement weather conditions. Whether or not ##**Thursday, February 3rd**## office hours will be held online is to be determined. -==== - -xref:19000-s2022-officehours.adoc[STAT 19000 Office Hours for Spring 2022] - -xref:29000-s2022-officehours.adoc[STAT 29000 Office Hours for Spring 2022] - -xref:39000-s2022-officehours.adoc[STAT 39000 Office Hours for Spring 2022] - -[NOTE] -==== -**Office hours _during_ seminar:** Hillenbrand C141 -- the atrium inside the dining court + -**Office hours _outside_ of seminar, before 5:00 PM EST:** Hillenbrand Lobby C100 -- the lobby between the 2 sets of front entrances + -**Office hours _after_ 5:00 PM EST:** Online in Webex + -**Office hours on the _weekend_:** Online in Webex -==== - -== About the Office Hours in The Data Mine - -During Spring 2022, office hours will be in person in Hillenbrand Hall during popular on-campus hours, and online via Webex during later hours (starting at 5:00PM). Each TA holding an online office hour will have their own WebEx meeting setup, so students will need to click on the appropriate WebEx link to join office hours. In the meeting room, the student and the TA can share screens with each other and have vocal conversations, as well as typed chat conversations. You will need to use the computer audio feature, rather than calling in to the meeting. There is a WebEx app available for your phone, too, but it does not have as many features as the computer version. - -The priority is to have a well-staffed set of office hours that meets student traffic needs. **We aim to have office hours when students need them most.** - -Each online TA meeting will have a maximum of 7 other people able to join at one time. Students should enter the meeting room to ask their question, and when their question is answered, the student should leave the meeting room so that others can have a turn. Students are welcome to re-enter the meeting room when they have another question. If a TA meeting room is full, please wait a few minutes to try again, or try a different TA who has office hours at the same time. - -Students can also use Piazza to ask questions. The TAs will be monitoring Piazza during their office hours. TAs should try and help all students, regardless of course. If a TA is unable to help a student resolve an issue, the TA might help the student to identify an office hour with a TA that can help, or encourage the student to post in Piazza. - -The weekly projects are due on Friday evenings at 11:55 PM through Gradescope in Brightspace. All the seminar times are on Mondays. New projects are released on Thursdays, so students have 8 days to work on each project. - -All times listed are Purdue time (Eastern). - diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-schedule.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-schedule.adoc deleted file mode 100644 index e2c8b9d44..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-schedule.adoc +++ /dev/null @@ -1,124 +0,0 @@ -= Spring 2022 Course Schedule - Seminar - -Assignment due dates are listed in *BOLD*. Other dates are important notes. - -*Remember, only your top 10 out of 14 project scores are factored into your final grade. - -[cols="^.^1,^.^3,<.^15"] -|=== - -|*Week* |*Date* ^.|*Activity* - -|1 -|1/10 – 1/16 -|Monday, 1/10: First day of spring 2022 classes - - - -|2 -|1/17 - 1/23 -|Monday, 1/17: Dr. MLK Day - No classes - -*Project #1 due on Gradescope by 11:59 PM ET on Friday, 1/21* - -*Syllabus Quiz due on *Brightspace by 11:59 PM ET on Friday, 1/21* - -*Academic Integrity Quiz due on *Brightspace by 11:59 PM ET on Friday, 1/21* - - -|3 -|1/24 – 1/30 -| *Project #2 due on Gradescope by 11:59 PM ET on Friday, 1/28* - - - -|4 -|1/31 – 2/6 -| *Project #3 due on Gradescope by 11:59 PM ET on Friday, 2/4* - -*Outside Event #1 due on Gradescope by 11:59 PM ET on Friday, 2/4* - - -|5 -|2/7 – 2/13 -|*Project #4 due on Gradescope by 11:59 PM ET on Friday, 2/11* - - - -|6 -|2/14 – 2/20 -| *Project #5 due on Gradescope by 11:59 PM ET on Friday, 2/18* - - - - - -|7 -|2/21 – 2/27 -|*Project #6 due on Gradescope by 11:59 PM ET on Friday, 2/25* - - - -|8 -|2/28 – 3/6 -|*Project #7 due on Gradescope by 11:59 PM ET on Friday, 3/4* - -*Outside Event #2 due on Gradescope by 11:59 PM ET on Friday, 3/4* - -|9 -|3/7 – 3/13 -|*Project #8 due on Gradescope by 11:59 PM ET on Friday, 3/11* - - - -|10 -|3/14 - 3/20 -|Spring Break - No Classes - - -|11 -|3/21 – 3/27 -|*Project #9 due on Gradescope by 11:59 PM ET on Friday, 3/25* - -|12 -|3/28 – 4/3 -|*Project #10 due on Gradescope by 11:59 PM ET on Friday, 4/1* - - -|13 -|4/4 – 4/10 -|*Project #11 due on Gradescope by 11:59 PM ET on Friday, 4/8* - - -|14 -|4/11 – 4/17 -|*Project #12 due on Gradescope by 11:59 PM ET on Friday, 4/15* - -*Outside Event #3 due on Gradescope by 11:59 PM ET on Friday, 4/15* - - -|15 -|4/18 – 4/24 -|*Project #13 due on Gradescope by 11:59 PM ET on Friday, 4/22* - -|16 -|4/25 – 5/1 -|*Project #14 due on Gradescope by 11:59 PM ET on Friday, 4/29* - -Saturday, 4/30: Last day of spring 2022 classes. - - - - - -| -|5/2 – 5/7 -|Final Exam Week – There are no final exams in The Data Mine. - - -| -|5/10 -|Tuesday, 5/10: Spring 2022 grades are submitted to Registrar’s Office by 5 PM ET - - -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-syllabus.adoc b/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-syllabus.adoc deleted file mode 100644 index 043dc4a54..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2022/logistics/s2022-syllabus.adoc +++ /dev/null @@ -1,321 +0,0 @@ -= Spring 2022 Syllabus - The Data Mine Seminar - -== Course Information - - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -STAT 19000 – The Data Mine II, CRNs vary -STAT 29000 – The Data Mine IV, CRNs vary -STAT 39000 – The Data Mine VI, CRNs vary -|=== - -*Course credit hours:* 1 credit hour, so you should expect to spend about 3 hours per week doing work -for the class - -*Prerequisites:* -None for STAT 19000. All students, regardless of background are welcome. Typically, students new to The Data Mine sign up for STAT 19000, students in their second year of The Data Mine sign up for STAT 29000, and students in their third year of The Data Mine sign up for STAT 39000. However, during the first week of the semester only, if a student new to The Data Mine has several years of coding and/or data science experience and would prefer to switch to STAT 29000, we can make adjustments. A placement exam was offered before the semester started, too. - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. Stduents will also take the quizzes at the beginning of the semester on Brightspace -- link:https://datamine.purdue.edu[*The Data Mine’s website*] - helpful resource -- link:https://ondemand.brown.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Brown*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium—no meal swipe required), and 4:30 pm (link:https://purdue.webex.com/meet/mdw[synchronous online], recorded and posted later). All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings, but this is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. Attendance is not required. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -1. Discover data science and professional development opportunities in order to prepare for a career. -2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -3. Design efficient search strategies in order to acquire new data science skills. -4. Devise the most appropriate data science strategy in order to answer a research question. -5. Apply data science techniques in order to answer a research question about a big data set. - - - -=== Required Materials - -• A laptop so that you can easily work with others. Having audio/video capabilities is useful. -• Brightspace course page. -• Access to Jupyter Lab at the On Demand Gateway on Brown: -https://ondemand.brown.rcac.purdue.edu/ -• “The Examples Book”: https://the-examples-book.com -• Good internet connection. - - - -=== Attendance Policy - -While everything we are doing in The Data Mine this semester can be done online, rather than in person, and no part of your seminar grade comes from attendance, we want to remind you of general campus attendance policies during COVID-19. Students should stay home and contact the Protect Purdue Health Center (496-INFO) if they feel ill, have any symptoms associated with COVID-19, or suspect they have been exposed to the virus. In the current context of COVID-19, in-person attendance will not be a factor in the final grades, but the student still needs to inform the instructor of any conflict that can be anticipated and will affect the submission of an assignment. Only the instructor can excuse a student from a course requirement or responsibility. When conflicts can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. For unanticipated or emergency conflict, when advance notification to an instructor is not possible, the student should contact the instructor as soon as possible by email or by phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases of bereavement, quarantine, or isolation, the student or the student’s representative should contact the Office of the Dean of Students via email or phone at 765-494-1747. Below are links on Attendance and Grief Absence policies under the University Policies menu. - - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title, Email -Shared email we all read, , datamine@purdue.edu -Kevin Amstutz, Senior Data Scientist and Instruction Specialist, kamstut@purdue.edu -Jamie Baker, Senior Administrative Assistant, jcbaker@purdue.edu -Maggie Betz, Managing Director of Corporate Partnerships, betz@purdue.edu -Jarai Carter, Limited Term Lecturer, -Shuennhau Chang, Corporate Partners Senior Manager, chang@purdue.edu -Nicole Finley, Operations Manager, kingman@purdue.edu -David Glass, Senior Data Scientist, dglass@purdue.edu -Heather Goodwin, Corporate Partners Senior Manager, hgoodwi@purdue.edu -Norma Grubb, Limited Term Lecturer, -Dave Kotterman, Managing Director, dkotter@purdue.edu -Rebecca Sharples, Managing Director of Academic Programs & Outreach, rebecca@purdue.edu -Dr. Mark Daniel Ward, Director, mdw@purdue.edu - -|=== - - -*For the purposes of getting help with this 1-credit seminar class, your most important people are:* - -• *T.A.s:* Visit their xref:spring2022/logistics/officehours.adoc[office hours] and use the link:https://piazza.com/[Piazza site] -• *Mr. Kevin Amstutz*, Senior Data Scientist and Instruction Specialist - Piazza is preferred method of questions -• *Dr. Mark Daniel Ward*, Director: Dr. Ward responds to questions on Piazza faster than by email - - -=== Communication Guidance - -• *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest email by using Piazza versus emailing us. -• For general Data Mine questions, email datamine@purdue.edu -• For regrade requests, use Gradescope’s regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:spring2022/logistics/officehours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on WebEx. Check the schedule to see the available schedule. - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes. - - -== Assignments and Grades - - -=== Course Schedule & Due Dates - -xref:spring2022/logistics/s2022-schedule.adoc[Click here to view the Spring 2022 Course Schedule] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some “housekeeping” tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the Friday at 11:55 pm Purdue West Lafayette (Eastern) time. You will need to do 3 Outside Event reflections. - -We will have 14 weekly projects available, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won’t hurt your grade. We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren’t done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions. - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - - - -=== Grading Scale -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor’s discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+. - -* A: 100.000% – 90.000% -* B: 89.999% – 80.000% -* C: 79.999% – 70.000% -* D: 69.999% – 60.000% -* F: 59.999% – 0.000% - - - -=== Late Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can’t do this if we are waiting for other students to turn their work in. - - -=== Projects - -• The projects will help you achieve Learning Outcomes #2-5. -• Each weekly programming project is worth 10 points. -• There will be 13 projects available over the semester, and your best 10 will count. -• The 3 project grades that are dropped could be from illnesses, absences, travel, family -emergencies, or simply low scores. No excuses necessary. -• No late work will be accepted, even if you are having technical difficulties, so do not work at the -last minute. -• There are many opportunities to get help throughout the week, either through Piazza or office -hours. We’re waiting for you! Ask questions! -• Follow the instructions for how to submit your projects properly through Gradescope in -Brightspace. -• It is ok to get help from others or online, although it is important to document this help in the -comment sections of your project submission. You need to say who helped you and how they -helped you. -• Each week, the project will be posted on the Thursday before the seminar, the project will be -the topic of the seminar and any office hours that week, and then the project will be due by -11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -• If you need to request a regrade on any part of your project, use the regrade request feature -inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - - -=== Outside Event Reflections - -• The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -• Throughout the semester, The Data Mine will have many special events and speakers, typically happening in person so you can interact with the presenter, but some may be online and possibly recorded. -• These eligible opportunities will be posted on The Data Mine’s website (https://datamine.purdue.edu/events/) and updated frequently. Feel free to suggest good events that you hear about, too. -• You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. -• You are welcome to do all 3 reflections early. For example, you could submit all 3 reflections in September. -• You must submit your outside event reflection within 1 week of attending the event or watching the recording. -• Follow the instructions on Brightspace for writing and submitting these reflections. -• At least one of these events should be on the topic of Professional Development. These -events will be designated by “PD” next to the event on the schedule. -• For each of the 3 required events, write a minimum 1-page (double-spaced, 12-pt font) reflection that includes the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event—it is a reflection. The header of your reflection should not take up more than 2 lines! -• We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - - - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -• Be excited to challenge yourself and learn impressive new skills. Don’t get discouraged if something is difficult—you’re here because you want to learn, not because you already know everything! -• Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don’t wait until the due date to start! -• Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -• Network! Get to know your classmates, even if you don’t see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -• Use “The Examples Book” with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but “The Examples Book” has been carefully put together to be the most useful to you. https://the-examples-book.com -• Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -• Don’t forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -• If you get behind or feel overwhelmed about this course or anything else, please talk to us! -• Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you -should keep a copy of the course schedule where you see it easily. -• Read your emails! - - - -== Purdue Policies & Resources - -=== Academic Guidance in the Event a Student is Quarantined/Isolated - -If you must miss class at any point in time during the semester, please reach out to me via email so that we can communicate about how you can maintain your academic progress. If you find yourself too sick to progress in the course, notify your adviser and notify me via email or Brightspace. We will make arrangements based on your particular situation. Please note that, according to link:https://protect.purdue.edu/updates/purdue-announces-additional-details-for-students-on-normal-operations-for-fall-2021/[Details for Students on Normal Operations for Fall 2021] announced on the Protect Purdue website, “individuals who test positive for COVID-19 are not guaranteed remote access to all course activities, materials, and assignments.” - -=== Class Behavior - -You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties—this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group. - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In STAT 19000/29000/39000, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for STAT 19000/29000/39000 to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and/or being reported to the Office of The Dean of Students. - -*Good Collaboration:* - -• First try the project yourself, on your own. -• After trying the project yourself, then get together with a small group of other students who -have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -• Finish the project on your own so that what you turn in truly represents your own understanding of the material. -• Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -• If the assignment involves writing a long, worded explanation, you may proofread somebody’s completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -• Divide up the problems among a group. (You do #1, I’ll do #2, and he’ll do #3: then we’ll share our work to get the assignment done more quickly.) -• Attend a group work session without having first worked all of the problems yourself. -• Allowing your partners to do all of the work while you copy answers down, or allowing an -unprepared partner to copy your answers. -• Letting another student copy your work or doing the work for them. -• Sharing files or typing on somebody else’s computer or in their computing account. -• Getting help from a classmate or a TA without documenting that help in the comments section. -• Looking up a potential solution online without documenting that help in the comments section. -• Reading someone else’s answers before you have completed your work. -• Have a tutor or TA work though all (or some) of your problems for you. -• Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -• Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -• Using somebody else’s outside event reflection rather than attending the event yourself. - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] “As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - - -*Purdue’s Copyrighted Materials Policy:* - -Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally. -Notes taken in class are, however, generally considered to be “derivative works” of the instructor’s presentations and materials, and they are thus subject to the instructor’s copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose. - -=== Nondiscrimination Statement -Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue’s nondiscrimination policy statement.] - -=== Students with Disabilities -Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247. - -If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you. - -=== Mental Health Resources - -• *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time. -• *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm. -• *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu. -• *If you’re struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours. - -=== Violent Behavior Policy - -Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University’s full violent behavior policy] for more detail. - -=== Diversity and Inclusion Statement - -In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points: - -• We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable. -• Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker’s intention. -• We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves. - -=== Basic Needs Security Resources - -Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 – 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community). - -Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund]. - -=== Course Evaluation - -During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system. - -You may email feedback to us anytime at link:mailto:datamine@purdue.edu[datamine@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you! - -=== General Classroom Guidance Regarding Protect Purdue - -Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace. - -=== Campus Emergencies - -In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course: - -• Brightspace or by e-mail from Data Mine staff. -• General information about a campus emergency can be found on the Purdue website: link:www.purdue.edu[]. - - -=== Illness and other student emergencies - -Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students. - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. - diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project01.adoc deleted file mode 100644 index 00c3ab29a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project01.adoc +++ /dev/null @@ -1,252 +0,0 @@ -= TDM 10200: Project 1 -- 2023 - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -**Context:** This is our first project this spring semester. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** Python, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Learn how to run Python code in Jupyter Lab on Anvil. -- Read and write basic (csv) data using Python. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/1991.csv` -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` -- `/anvil/projects/tdm/data/disney/flight_of_passage.csv` - -== Questions - -=== How to Login to Anvil - -++++ - -++++ - -=== Setting up to start working on a project - -++++ - -++++ - -=== ONE - -++++ - -++++ - -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo). You will be met with a screen, with lots of options. Don't worry, however, the next steps are very straightforward. - -[TIP] -==== -If you did not (yet) setup your 2-factor authentication credentials with Duo, you can go back to Step 9 and setup the credentials here: https://the-examples-book.com/starter-guides/data-engineering//rcac/access-setup -==== - -Towards the middle of the top menu, there will be an item labeled btn:[My Interactive Sessions], click on btn:[My Interactive Sessions]. On the left-hand side of the screen you will be presented with a new menu. You will see that there are a few different sections: Bioinformatics, Interactive Apps, and The Data Mine. Under "The Data Mine" section, you should see a button that says btn:[Jupyter Notebook], click on btn:[Jupyter Notebook]. - -If everything was successful, you should see a screen similar to the following. - -image::figure01.webp[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -Make sure that your selection matches the selection in **Figure 1**. Once satisfied, click on btn:[Launch]. Behind the scenes, OnDemand launches a job to run Jupyter Lab. This job has access to 2 CPU cores and 3800 Mb. - -[NOTE] -==== -If you select 4000 Mb of memory instead of 3800 Mb, you will end up getting 3 CPU cores instead of 2. OnDemand tries to balance the memory to CPU ratio to be _about_ 1900 Mb per CPU core. -==== - -We use the Anvil cluster because it provides a consistent, powerful environment for all of our students, and it enables us to easily share massive data sets with the entire Data Mine. - -After a few seconds, your screen will update and a new button will appear labeled btn:[Connect to Jupyter]. Click on btn:[Connect to Jupyter] to launch your Jupyter Lab instance. Upon a successful launch, you will be presented with a screen with a variety of kernel options. It will look similar to the following. - -image::figure02.webp[Kernel options, width=792, height=500, loading=lazy, title="Kernel options"] - -There are 2 primary options that you will need to know about. - -f2022-s2023:: -The course kernel where Python code is run without any extra work, and you have the ability to run R code or SQL queries in the same environment. - -[TIP] -==== -To learn more about how to run R code or SQL queries using this kernel, see https://the-examples-book.com/projects/current-projects/templates[our template page]. -==== - -Let's focus on the f2022-s2023 kernel. Click on btn:[f2022-s2023], and a fresh notebook will be created for you. - -Each line in the Jyupter Notebook is called a `cell`. There are two primary types of cells; code, and markdown. By default the cell will be a a code cell. A markdown cell displays text that that can be formatted using https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet[markdown language] and will not be treated as code. You can read more about markdown https://guides.github.com/features/mastering-markdown/[here]. - -.Insider Knowledge -[%collapsible] -==== -using a *#* before writing in a code cell is a comment. A comment can be documentation of the code that will follow in the cell below. Documentation is important so that others can determine and understand your code. -To add comments you can use the *#* tag . Comments are not run as code so they don't influence the result and are ignored when you run the cell. -==== - -In the first cell, create a markdown cell that has your name and the course number. inside the second cell comment out "Print the sum of 7 and 10", then place the Python code on the next line and run the cell. What is the output? - -[source,python] ----- -print(7+10) ----- - -.Helpful Hint -[%collapsible] -==== -To run the code in a code cell, you can either press kbd:[Ctrl+Enter] on your keyboard or click the small "Play" button in the notebook menu. -==== - -.Items to submit -==== -- Result of code. -==== - -=== TWO - -++++ - -++++ - -In the upper right-hand corner of your notebook, you will see the current kernel for the notebook, `f2022-s2023`. If you click on this name you will have the option to swap kernels out -- no need to do this yet, but it is good to know! - -.There are different data types in Python, some of the built in types include: -* Integer (int) -* Float (float) -* string (str) -* types can include list, tuple, range -* Mapping data type (dict) -* Boolean type (bool) - -.Insider Knowledge -[%collapsible] -==== -Numeric - -. int - holds signed integers of non-limited length. -. long- holds long integers(exists in Python 2.x, deprecated in Python 3.x). -. float- holds floating precision numbers and it is accurate up to 15 decimal places. -. complex- holds complex numbers. - -String - a sequence of characters, generally strings are represented by single or double-quotes - -Lists- ordered sequence of data written using square brackets *[]* and commas *(,)*. - -Tuple- similar to a list but immutable. Data is written using a parenthesis *()* and commas *(,)*. - -Dictionary is an unordered sequence of data of key-value pair(two pieces of data that have a set of associated values, two related data elements). -==== -We are going to create a variable, we are assigning the numbers 1,2,3 to a variable called my_list. - -[source,python] ----- -my_list = [1, 2, 3] -print(f'My list is: {my_list}') ----- - -We are going to practice assigning variable and doing some simple requests in Python - -.One -.. create a variable named `x` and assign the number 6 to it -.. create a variable named `y` and assign the number 8 to it -.. create a variable named `z` and assign `x * y` to it -.. now print `z` - -.Two -.. assign `x,z,y` the same value of "peanutbutter" all in one line - -.Three -.. assign the ingredients of a club sandwich to the variable `club_sandwich` - -.Helpful Hint -[%collapsible] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/current-projects/templates[our template page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -This year, the first step to starting any project should be to download and/or copy https://the-examples-book.com/projects/current-projects/_attachments/project_template.ipynb[our project template] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`). - -Open the project template and save it into your home directory, in a new notebook named `firstname-lastname-project01.ipynb`. - -Fill out the project template, replacing the default text with your own information, and transferring all work you've done up until this point into your new notebook. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. - -.Items to submit -==== -- How many of each types of cells are there in the default template? -==== - -=== FOUR - -++++ - -++++ - -We are going to open up this ("/anvil/projects/tdm/data/disney/flight_of_passage.csv") dataset in Python. - -[source,python] ----- -import pandas as pd -disney= pd.read_csv("/anvil/projects/tdm/data/disney/flight_of_passage.csv") -disney ----- - -Using the Pandas library we are able to see how many rows and columns are in this dataset. Pandas is a data analysis library that is one of the most commonly used in Python. - - -.Items to submit -==== -- How many rows are in this data set? -- How many columns are in this dataset? -- Use the `head()` and `tail()` to look at the beginning and end of the data. -==== - -=== FIVE - -++++ - -++++ - -Let's pretend we are now done with the project. We've written some code, maybe added some markdown cells to explain what we did, and we are ready to submit our assignment. For this course, we will turn in a variety of files, depending on the project. - -We will always require a Jupyter Notebook file. Jupyter Notebook files end in `.ipynb`. This is our "source of truth" and what the graders will turn to first when grading. - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -A `.ipynb` file is generated by first running every cell in the notebook, and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb`, if a project uses Python code., you will need to also submit a Python script. A Python script is just a text file with the extension `.py`. - -Let's practice. take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Download your `.ipynb` file -- making sure that the output from all of your code is present and in the notebook (the `.ipynb` file will also be referred to as "your notebook" or "Jupyter notebook"). - -Once complete, submit your notebook,and Python script. - -.Items to submit -==== -- `firstname-lastname-project01.py`. -- `firstname-lastname-project01.ipynb`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project02.adoc deleted file mode 100644 index 5de112d7f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project02.adoc +++ /dev/null @@ -1,196 +0,0 @@ -= TDM 10200: Project 2 -- 2023 - -**Motivation:** Pandas will enable us to work with data in Python (in a similar way to the data frames that we learned in R in the fall semester). - -**Context:** This is our second project and we will continue to introduce some basic data types and go thru some similar control flow concepts like we did in `R`. - -**Scope:** tuples, lists, if statements, opening files - -.Learning Objectives -**** -- List the differences between lists & tuples and when to use each. -- Gain familiarity with string methods, list methods, and tuple methods. -- Demonstrate the ability to read and write data of various formats using various packages. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- /anvil/projects/tdm/data/death_records/DeathRecords.csv - -== Questions - -=== ONE - -++++ - -++++ - -`pandas` is an integral tool for various data science tasks in Python. You can read a quick intro https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[here]. - -Let's first learn how to create a simple dataframe. -If we have a store that we only sell hats that are blue and white and shoes that are red and purple. + -Imagine its a Saturday and we wanted to keep track of our customer's purchase. -Our first sale is a red shoe and a blue hat. Second sale is a purple shoe and a blue hat. Third is a red shoe and a blue hat, fourth is a purple shoe and a white hat, fifth is a red shoe and a white hat, the sixth and seventh sale are red shoes and blue hats. - -So it looks a bit like this: -[source, python] ----- -data = { - 'shoes':['red', 'purple', 'red', 'purple', 'red', 'red', 'red'], - 'hats': ['blue', 'blue', 'blue', 'white', 'white', 'blue', 'blue'] -} ----- - -[loweralpha] -.. Create a data set named 'data' -.. Take the data you created and make it into a dataframe named `store` -.. Now change the index numbers 0-6 to customers Jay, Mary, Bill, Chris, Martha, Karen, Rob - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -store = pd.DataFrame(data, index=['Jay', 'Mary', 'Bill', 'Chris', 'Martha','Karen', 'Rob']) - -store ----- -==== - - - -.Insider Knowledge -[%collapsible] -==== -`Pandas` allows you to extract data from a CSV (comma-separated values) file. `Pandas` is a great way to get acquainted with your data, including the ability to clean, transform, and analyze data. - -The two main components of pandas are the `series` and `DataFrame`. A `series` is one dimensional (you can think of it as a column of data) and a `DataFrame` is a table made up of a collection of `series`. - -Notice that the indexing for our dataframe starts at 0. In `python`, the indexing starts at 0, as compared to `R` in the fall semester, where the indexing began at 1. This is an important fact to remember. - -==== - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== TWO - -++++ - -++++ - -Open a new notebook in Jupyter Lab, and select the f2022-2023 kernel. -We want to go ahead and read in the dataset /anvil/projects/tdm/data/death_records/DeathRecords.csv into a `pandas` DataFrame called `myDF`. + - -[loweralpha] -.. Find the information for the 11th row in the dataframe -.. Find the last five rows of the data frame -.. Find how many rows and columns there are in the entire dataframe -.. Print just the column names - - - -.Helpful Hints -[%collapsible] -==== -[source,python] ----- -.head() -.tail() -.shape ----- -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== THREE - -++++ - -++++ - -Let's look for specific information in our dataframe so we can become a bit more familiar with what it contains. - -[loweralpha] -.. How many people over the age of 52 are on this list? -.. How many males vs how many females -.. How many females that are over the age of 70 on this list? - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== FOUR - -++++ - -++++ - -Now that we have a bit of familiarity with the data, let's introduce another common `python` package called `matplotlib` -Let's create a graphic using this package. - -[loweralpha] -.. Create a graphic that illustrates then number of people who are divorced, married, single, unmarried, or widowed. -.. Create another graphic that illustrates the distribution of the age of the person at the time of death. - - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -import matplotlib.pyplot as plt ----- -==== - -.Insider Knowledge -[%collapsible] -==== -*Matplotlib* is a data visualization and plotting library for `Python`. It provides easy ways to visualize data. -==== - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== FIVE - -Now that you are familiar with the data and have an introduction to plotting, create a plot of your choice, to summarize something that you find interesting about the data. - -[loweralpha] -.. Use `pandas` and your investigative skills to look thru the data and find an interesting fact, and then create a graphic that summarizes some of the data from our dataset! - - - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project03.adoc deleted file mode 100644 index a9b15151f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project03.adoc +++ /dev/null @@ -1,214 +0,0 @@ -= TDM 10200: Project 3 -- 2023 - -**Motivation:** Learning about Big Data. When working with large data sets, it is important to know how we can use loops to find our information, a little bit at a time, without reading in all of the files at once. -We will need to *set our cores to 4* for this when we spin up Juypter Notebook. This will give us more space, to handle processing this week's datasets. If we do not adjust the cores, then our kernel will crash every time we try to run a cell. - - -**Scope:** Python, If statements, for loops, - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -/anvil/projects/tdm/data/flights/subset/ - -== Questions - -=== ONE - -++++ - -++++ - -++++ - -++++ - -First let's go ahead and see what files are in this dataset - -[source, python] ----- -ls /anvil/projects/tdm/data/flights/subset/ ----- -This allows you to see all of the files that are in this dataset. - -[loweralpha] -.. Considering the files in this directory, which years of data are available? -.. What would the path be to access only the 2003 file in this data set? -.. Go ahead and import the library `pandas` as `pd`, and import `Path` from `pathlib` - -.Helpful Hints -[%collapsible] -==== -[source, python] ----- -/anvil/projects/tdm/data/flights/subset/2003.csv - -import pandas as pd -from pathlib import Path ----- -==== - -.Items to submit -==== -- Code used to answer the question. -==== - - - -=== TWO - -++++ - -++++ - -[source, python] ----- -files = [Path(f'/anvil/projects/tdm/data/flights/subset/{year}.csv') for year in range(1987,2009)] -files ----- -This code uses list comprehension to create a list of file paths, ranging from the years of 1987-2008 -*`files`* will now contain strings of file paths for *all* the csv files in this directory. - -Notice that, in a range in Python, the final number in the range is not included. - -Let's test out the first file (from 1987) to see if we can find the column names -[loweralpha] -.. How many columns are there? -.. What are the column names? -.. Display the data from the first five rows. - -.Helpful Hints -[%collapsible] -==== -#reads the first file into a df called `eightseven` -eightyseven = pd.read_csv(files[0]) -#looks for column names from the df -column_names = eightyseven.columns -print(column_names) -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - -=== THREE - -++++ - -++++ - -Let's look at the column `Origin` - -[loweralpha] -.. Print out the unique elements of the `Origin` column. -.. Find the number of times that `IND` occurs in the `Origin` column. - - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -eightyseven['Origin'].value_counts()['IND'] ----- -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of the code -==== - -=== FOUR - -++++ - -++++ - -Let's do the same thing for the 1988, 1989, and 1990 data sets, naming them `eightyeight`, `eightynine`, `ninety`. - -[loweralpha] -.. How many times is 'IND' the `Origin` airport in `eightyeight`? -.. How many times is 'IND' the `Origin` airport in `eightynine`? -.. How many times is 'IND' the `Origin` airport in `ninety`? - - -.Items to submit -==== -- Code used to answer questions -- Result of the code -==== - -=== FIVE - -++++ - -++++ - -Knowing that we can find how many times the value `IND` shows up in `Origin`, we could do this for each of the years, and add them up, to find the total number of times that Indianapolis airport was the origin airport for flights taken during the years 1987-2008. But that is a lot of work!! We can shorten the tedious work of keeping track and adding things manually, by (instead) using a for loop, to go thru all of the subset files and keep track of the total number of times that `IND` shows up in the `Origin` column. - -We could use this code -[source, python] ----- -count = 0 -for file in files: - df = pd.read_csv(file) - count += len(df[df['Origin'] == 'IND']) - -print(count) ----- -*BUT* before you go ahead and copy and paste, this code will take ALL of the files and read them into memory, and this will crash the kernel, even if we upped the cores to 4. - -This means we have to think of another way to do it. - -We have to use `for loops` which is a way to iterate to check for certain conditions and repeatedly execute them. This is very helpful when you come across situations in which you need to use a specific code over and over again but you don't want to write the same line of code multiple times. - -We need to consider a way that will allow us to go thru the files line by line, and read them but then not commit them to memory. In this way, we can go thru all of the data files and still keep track of how many occurrences we have, for a specific value. - -[source,python] ----- -total_count = 0 -for file in files: - for df in pd.read_csv(file, chunksize=10000): - for index, row in df.iterrows(): - if row['Origin'] == 'IND': - total_count += 1 - -print(total_count) ----- - -You will note that doing the above code DOES produce the correct answer BUT it take a very long time to run! -Is there a shorter way to run this code? - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -origin_ind = 0 -for file in files: - with open(file,'r') as f: - for line in f: - if line.split(",")[16] == 'IND': - origin_ind += 1 -print(origin_ind) ----- -==== - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project04.adoc deleted file mode 100644 index 8914cc507..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project04.adoc +++ /dev/null @@ -1,299 +0,0 @@ -= TDM 10200: Project 4 -- 2023 - -**Motivation:** In the last project, we spent time using if statements and for loops, today we are going to take a step back and learn more about loops. There are three main types of loops. `for` loops, `while` loops, and nested loops. We will also talk about `tuples` and `lists`. -We will also learn about one of the most useful data structures in Python, a `dictionary` commonly referred to as *`dict`*. - - -**Context:** We will continue to introduce some basic data types and go thru some similar control flow concepts like we did in `R`. - -**Scope:** tuples, lists, loops, dict - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -/anvil/projects/tdm/data/craigslist/vehicles.csv - -== Questions -read in the dataset and name it `cars` - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -import pandas as pd -cars = pd.read_csv("/anvil/projects/tdm/data/craigslist/vehicles.csv") ----- -==== - - -=== ONE - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -A `dict` contains a collection of key value pairs -[source,python] ----- -NFL_team = dict([ - ('Indiana', 'Colts'), - ('Kansas City', 'Chiefs'), - ('Philadelphia', 'Eagles'), - ('Minnesota', 'Vikings'), - ('New England', 'Patriots'), - ('Miami', 'Dolphins') -]) -print(NFL_team) -#output of code -{'Indiana': 'Colts', 'Kansas City': 'Chiefs', 'Philadelphia': 'Eagles', 'Minnesota': 'Vikings', 'New England': 'Patriots', 'Miami': 'Dolphins'} ----- - -There are two primary ways to retrieve information from a `dict`. - -* mydict.get() -* mydict[] - -[loweralpha] -.. Create a dictionary of MLB teams for the American League, call it `MLB_teams` -.. Now add the MLB teams from the National League to the current `dict`. -.. Delete all the teams that are South of Tennessee and North Carolina. (Mississippi, Alabama, Georgia, South Carolina, and Florida.) - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== TWO - -++++ - -++++ - -++++ - -++++ - - -Loops are important in any programming language, because they help to execute code repeatedly. + - -A `while` loop executes a block of statements repeatedly, until a given condition is satisfied. - -It looks a bit like this: -[source, python] ----- -count = 0 -while (count < 15): - count = count + 2 - print ("Yay!") ----- - -You can also pair an `else` statement with a `while` loop. The `else` statement will ONLY be executed when the `while` condition is false. -[source, python] ----- -while condition - # executes specific statments -else: - # execute specific statments ----- - -We can add an `else` statement that will print "Boo!" when the condition `count < 15` fails to be true (at the end of the loop) -[source, python] ----- -count = 0 -while (count < 15): - count = count + 2 - print ("Yay!") -else: - print ("Boo!") ----- -.. Use a while loop to print a series of numbers from 0 to 200, counting by 10's - -.. Put the phrase "Old McDonald had a farm e-i-e-i-o" into a string and call it `words`. Print everything in the string *EXCEPT* the letter *a* - -.. Now take `words` and replace each occurrence of the symbol *-* with an asterisk *** - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== THREE - -++++ - -++++ - -A `for` loops is typically used for going thru a list, array, or a string. Typically it runs a specific code over and over again, `for` a defined number of times in a sequence. A `while` loop runs until it hits a certain condition, but a `for` loop iterates over items within a sequence or list. - -[source, python] ----- -for itarator_variable in sequence_name: - statements - ... - statements ----- - -.Insider information -[%collapsible] -==== --The first word of the statement is `for` which identifies that it is the beginning of the `for` loop. + -- The `iterator variable` is a variable that changes each time the loop is executed. + -- The keyword `in` shows the iterator variable which elements to loop over in a sequence. + -- The statements allow you to preform various functions -==== -.Helpful Hint -[%collapsible] -==== -- *enumerate()* The function enumerate() allows us to iterate thru a sequence but it keeps track of the index and element. It can also be converted into a list of tuples using the `list()` function. + -[source, python] ----- -#create list of fruit -fruit = ['cherry', 'banana', 'orange', 'kiwi', 'apple'] -#enumerate fruit but start at number one since default is 0 -num_fruit = enumerate(fruit, start=1) -#print the enumerate object as a list -print (list(num_fruit)) -#output from code -[(1, 'cherry'), (2, 'banana'), (3, 'orange'), (4, 'kiwi'), (5, 'apple')] ----- -- *range()* The function is built into python that allows for iteration through a sequence of numbers. `range()` will never include the stop number in its result (aka 6) and always includes 0 + -[source,python] ----- -range(6) -for n in range(6): - print(n) -#output from code -0 -1 -2 -3 -4 -5 ----- -==== - -[loweralpha] -.. Create a `for` loop -.. Now add in the `enumerate()` function to your `for` loop. -.. Create a 'for' loop with the `range()` function - -Check out the Helpful Hint for an examples - -.Insider Knowledge -[%collapsible] -==== -Notice that the indexing for our dataframe starts at 0. In `Python` and other programming languages, the indexing starts at 0. In contrast, during our previous semester, working in `R`, the indexing began at 1. This is an important fact to remember. -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - -=== FOUR - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -From the dataset `cars` create a `dict` called `mydict` that contains key:value pairs. The keys should be the years and the values are single integers representing the number of vehicles from that year. - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -myyears = cars['year'].dropna().to_list() -# get a list containing each unique year -unique_years = list(set(myyears)) -# for each year (key), initialize the value (value) to 0 -mydict = {} -for year in unique_years: - mydict[year] = 0 ----- -==== - -From the new dictionary that you created, find the number of cars, during each of these years: -[loweralpha] -.. 2011 -.. 1989 -.. 1997 - - -.Items to submit -==== -- Code used to answer the question -- Result of the code -==== - - - -=== FIVE - -++++ - -++++ - -Now that we have a bit of familiarity with the data, let's revisit another common `Python` package, called 'matplotlib' -Let's create some graphics using this package. -[loweralpha] -.. Create a bar graph that has years on x-axis and number of vehicles on the y-axis -.. Create a graph of something that you find interesting about the data. - - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -import matplotlib.pyplot as plt ----- - -==== - -.Items to submit -==== -- Code used to answer the question -- Result of the code -==== - - - - -[NOTE] -==== -TA applications for The Data Mine are currently being accepted. Please visit us https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE[here] to apply! -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project05.adoc deleted file mode 100644 index ed00c7f53..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project05.adoc +++ /dev/null @@ -1,192 +0,0 @@ -= TDM 10200: Project 5 -- 2023 - -**Motivation:** Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use. - - -**Context:** Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools. - -**Scope:** functions - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -/anvil/projects/tdm/data/craigslist/vehicles.csv - -/anvil/projects/tdm/data/flights/subset/ - -/anvil/projects/tdm/data/death_records/DeathRecords.csv - - -[TIP] -==== -We have written several examples to get you started with functions https://the-examples-book.com/programming-languages/python/writing-functions[here]. All 6 of the videos for this project are given at the top of this page. At the beginning, you probably only need to read the section on https://the-examples-book.com/programming-languages/python/writing-functions#arguments[Arguments]. -==== - -[TIP] -==== -If it helps, you also have a longer article available https://realpython.com/defining-your-own-python-function/[here]. It is a very detailed article going through many things that you can do with functions in Python. In particular, the section on https://realpython.com/defining-your-own-python-function/#argument-passing[Argument Passing] might be helpful. -==== - -== Questions - -=== ONE - -Read in the dataset from Project 4 - -/anvil/projects/tdm/data/craigslist/vehicles.csv - -and name it `cars`. - -[loweralpha] -.. Write a function called `mycarcount` that takes two parameters: `cars` as a data frame, and `year` as an integer, and outputs the number of cars from that `year`. (Alternatively, you can just use 1 argument, the `year`, as a parameter, and then read through the `cars` data frame inside the function. Either way is OK.) -.. Run the function for each of the years from Project 4, Question 4, namely, for the years 2011, 1989, 1997. Make sure that your answers agree with the results from that earlier project. - -[TIP] -==== -You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame called `cars` or you can read just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. Either way is OK! -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== TWO - - -[loweralpha] -.. Run the function `mycarcount` for each year in the data set. (Of course, be sure to only run it once for each year!) -.. Now make sure that the results agree, if you compare with the `value_counts()` from the `year` column. - -[TIP] -==== -It will take a long time to run `mycarcount` on each year in the data set, so you might want to start by running `mycarcount` on just a few years, for instance, on 5 years or 10 years, to make sure that things work, before running `mycarcount` on all of the years. -==== - - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== THREE - -Use the csv data sets from this directory (the same as from Project 3): - -/anvil/projects/tdm/data/flights/subset/ - - -[loweralpha] -.. Write a function that takes two parameters: `myorigin` as a string with three characters, and `year` as an integer, and outputs the number of flights that depart during that year, from the `Origin` airport indicated in `myorigin`. -.. Test your function for a few years and airports of your choice. You can choose! Do your results look reasonable, i.e., do the airports in the big cities have lots of flights, compared to airports in smaller cities? -.. Run the function for each of the years from 1987 to 2008, checking how many flights depart from `IND` in each year. Make sure that you use the method from the end of Project 3, Question 5. - - -[TIP] -==== -For this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. -==== - - -[TIP] -==== -It will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years. -==== - - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -total_count = 0 -for df in pd.read_csv(putthefilenamehere, chunksize=10000): - for index, row in df.iterrows(): - if row['Origin'] == myorigin: - total_count += 1 ----- -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== FOUR - -Extend your function from Question 3 as follows: - -[loweralpha] -.. Modify your function so that it takes three parameters: `myorigin` and `mydest` as strings that each have three characters, and `year` as an integer, and outputs the number of flights that depart during that year, from the `Origin` airport indicated in `myorigin`, and arrive at the `Dest` airport indicated in `mydest`. -.. Test your function for a few years and pairs of airports (origin and destination airports) of your choice. Do the results look reasonable, e.g., if you compare popular flight paths, versus unpopular flight paths? -.. Run the function for each of the years from 1987 to 2008, checking how many flights depart from `IND` and arrive at `ORD` in each year. - -[TIP] -==== -Again, for this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. -==== - -[TIP] -==== -Again, it will take a long time to run your function on each year in the data set, so you might want to start by running your function on just a few years, for instance, on 3 years or 5 years, to make sure that things work, before running your function on all of the years. -==== - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - -=== FIVE - - -Use the csv data set for the DeathRecords from Project 2: - -/anvil/projects/tdm/data/death_records/DeathRecords.csv - -[loweralpha] -.. Write a function that takes two parameters: `Sex` (which will be `F` or `M`) and `MaritalStatus` (`D` or `M` or `S` or `U` or `W`), and outputs the number of people with the indicated `Sex` and `MaritalStatus` in the data set. (If you look at an earlier version of this question, in which we asked about the year of death, well, everyone in the data set died in 2014, so you do not need to worry about the year of death.) - -[TIP] -==== -You can solve this question in a couple different ways. You can either read in the entire dataset into a data frame, or you can read just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. Either way is OK! -==== - - - -.Items to submit -==== -- Code used to answer the question -- Result of the code -==== - - - - -[NOTE] -==== -TA applications for The Data Mine are currently being accepted. Please visit us https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE[here] to apply! -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project06.adoc deleted file mode 100644 index 4d6b548f0..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project06.adoc +++ /dev/null @@ -1,158 +0,0 @@ -= TDM 10200: Project 6 -- 2023 - -**Motivation:** Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use. - - -**Context:** Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools. - -**Scope:** functions - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -/anvil/projects/tdm/data/craigslist/vehicles.csv - -/anvil/projects/tdm/data/flights/subset/ - - -[TIP] -==== -We have written several examples to get you started with functions https://the-examples-book.com/programming-languages/python/writing-functions[here]. All 6 of the videos from Project 5 are given at the top of this page. -==== - -[TIP] -==== -UPDATE: WE ADDED 3 MORE VIDEOS FOR PROJECT 6, which are given under https://the-examples-book.com/programming-languages/python/writing-functions#new-videos-for-project-6[the heading Project 6 videos in this section]. -==== - -[TIP] -==== -If it helps, you also have a longer article available https://realpython.com/defining-your-own-python-function/[here]. It is a very detailed article going through many things that you can do with functions in Python. In particular, the section on https://realpython.com/defining-your-own-python-function/#argument-passing[Argument Passing] might be helpful. -==== - -== Questions - -=== ONE - -Read in the dataset from Project 4 - -/anvil/projects/tdm/data/craigslist/vehicles.csv - -and name it `cars`. - -[loweralpha] -.. Modify your `mycarcount` function from Question 1 of Project 5, so that it takes a list of years as a parameter, and prints the number of vehicles from each year in your list. -.. Now test your `mycarcount` function on the list of years `[2011, 1989, 1997]`. The output from `mycarcount(cars, [2011, 1989, 1997])` or from `mycarcount([2011, 1989, 1997])` should be the number of vehicles from each of the years 2011, 1989, and 1997, respectively. - -[TIP] -==== -As in Project 5, you can either have `cars` and the list of years as two arguments, or you can just have (only) the list of years as an argument and import the data from `cars` inside the function itself using the `chunksize = 10000` and the `iterrows` like in the videos. Either way is OK. -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== TWO - - -[loweralpha] -.. Write a loop that prints the number of vehicles from `chicago` as the `region` in each of the years 2016, 2017, 2018. -(I.e., you should have 3 lines of output.) -.. Now write a double-loop that prints the number of vehicles from each `region` in the list `[chicago, indianapolis, cincinnati]` in each of the years 2016, 2017, 2018. -(I.e., you should have 9 lines of output.) - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - -=== THREE - - -[loweralpha] -.. Write a function with two arguments: a list of regions, and a list of years. The function should print a listing that shows the number of vehicles from each of those regions in each of those years. -(I.e., it will print one line of output for each region during each year.) -.. Use your new function to re-create the answer to question 2b. -.. Test your function on some lists of regions and lists of years of your choice. - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - - - -=== FOUR - -Use the function from Question 3a in Project 5. - - -[loweralpha] -.. Write a loop that prints the number of flights that depart from `IND` as the `Origin` airport in each of the years 1988, 1989, 1990. -(I.e., you should have 3 lines of output.) -.. Now write a double-loop that prints the number of flights that depart from each of the airports `IND`, `ORD`, `CVG` as the `Origin` airport in each of the years 1988, 1989, 1990. -(I.e., you should have 9 lines of output.) - - -[TIP] -==== -For this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== FIVE - -Extend your function from Question 3 as follows: - -[loweralpha] -.. Write a function with two arguments: a list of `Origin` airports, and a list of years. -The function should print a listing that shows the number of flights departing from each of those `Origin` airports in each of those years. -(I.e., it will have one line of output for each `Origin` airport during each year.) -.. Use your new function to re-create the answer to question 4b. -.. Test your function on some lists of `Origin` airports and lists of years of your choice. - -[TIP] -==== -Again, for this question, you should not read the full data frame all at once, but instead, you should just a few lines at a time, using the `chunksize = 10000` and the `iterrows` like in the videos. -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -[NOTE] -==== -TA applications for The Data Mine are currently being accepted. Please visit us https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE[here] to apply! -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project07.adoc deleted file mode 100644 index eb3f4d4c6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project07.adoc +++ /dev/null @@ -1,138 +0,0 @@ -= TDM 10200: Project 7 -- 2023 - -**Motivation:** Pandas allows us to work with data frames. The actions that we perform on data frames will sometimes remind us of similar actions that we have performed on data frames during the previous semester with R. For instance, we often want to extract information about one or more variables, sometimes grouping the data according to one variable and summarizing another variable within each of those groups. - -**Context:** Unifying our understanding of Pandas and the ability to develop functions will allow us to systematically analyze data. - -**Scope:** Pandas and functions - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -/anvil/projects/tdm/data/flights/subset/ - - -== Questions - -=== ONE - -++++ - -++++ - -Make sure to have 2 cores when you start your Jupyter Lab session. - -For convenience, we will help you quickly get all of the data for all of the flights that depart or arrive at Indianapolis airport, as follows. Make a new cell in JupyterLab that has exactly this content (please copy and paste for accuracy): - -[source,bash] ----- -%%bash -head -n1 /anvil/projects/tdm/data/flights/subset/1987.csv >~/INDflights.csv -grep -h ",IND," /anvil/projects/tdm/data/flights/subset/*.csv >>~/INDflights.csv ----- - -We want to use Pandas to read in the data frame, and it will have a lot of columns. So we set Pandas to display an unlimited number of columns. - -[source,python] ----- -import pandas as pd -pd.set_option('display.max_columns', None) ----- - -Afterwards, in a separate cell in JupyterLab, you can read in your data to a Pandas data frame like this: - -[source,python] ----- -myDF = pd.read_csv('~/INDflights.csv') ----- - -Do not worry that you get a `DtypeWarning`; this will not affect our work on this project~ - -Now your data frame called `myDF` will contain all of the data for the flights (from October 1987 to April 2008) that depart or arrive at Indianapolis airport, which has 3-letter code `IND`. - -These files correspond to the years 1987 through 2008. Your data frame should contain all of the data for all of the flights with `IND` as the `Origin` or `Dest` airport. - -[loweralpha] -.. How many flights are there altogether in `myDF`? You can check this using `myDF.shape`. -.. How many of the flights are departing from `IND`? (I.e., the `Origin` airport is `IND`.) -.. How many of the flights are arriving to `IND`? (I.e., the `Dest` airport is `IND`.) - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== TWO - -++++ - -++++ - -[loweralpha] -.. For flights departing from 'IND' (i.e., with `IND` as the `Origin`), what are the 20 most popular destination airports (i.e., the 20 most popular `Dest` airports)? -.. For flights departing from 'IND' (i.e., with `IND` as the `Origin`), what are the 5 most popular airlines (i.e., the 5 most popular `UniqueCarrier`s)? - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - -=== THREE - -++++ - -++++ - -[loweralpha] -.. Wrap your work for question 2a into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 20 most popular destination airports in that data frame. -.. Wrap your work for question 2b into a function that takes 1 data frame as an argument and the corresponding 3-letter code as an argument, and finds the 5 most popular airlines in that data frame. - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -=== FOUR - -++++ - -++++ - -Test your functions from question 3a and 3b on a couple of other airports. Hint: If we use huge airports, we likely will not have enough member in Pandas and our kernel might crash. So we will consider some midsize airports for testing the functions. Test your functions from questions 3a and 3b on Jacksonville (`JAX`) and Buffalo (`BUF`). - - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - - - -[NOTE] -==== -TA applications for The Data Mine are currently being accepted. Please visit us https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE[here] to apply! -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project08.adoc deleted file mode 100644 index 2053084ce..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project08.adoc +++ /dev/null @@ -1,218 +0,0 @@ -= TDM 10200: Project 8 -- 2023 - -**Motivation:** We are going to take a step back and work towards building a beer recommendation system. As you know already, a key part of analysis is being able to write functions. - - -**Context:** We will continue to introduce functions and practice doing so! - -**Scope:** python, functions, pandas - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - --`/anvil/projects/tdm/data/yelp/data/parquet/` - -== Questions -Let's first list all of the files in this folder -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -ls /anvil/projects/tdm/data/yelp/data/parquet/ ----- -==== -We want to load *only* two `pandas` data frames to use in this project. We will not be working with the other data frames. -Don't forget to import pandas! -[source, python] ----- -users = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/users.parquet") -reviews = pd.read_parquet("/anvil/projects/tdm/data/yelp/data/parquet/reviews.parquet") ----- - - -[TIP] -==== -You will likely need to use 10 cores for this project because the files for this week are very large. -==== - - -++++ - -++++ - - -=== ONE - -++++ - -++++ - -++++ - -++++ - - -Typically we can assume that we surround ourselves with people with similar interests and we would have aligning tastes in restaurants and businesses. -Let's check that by writing a function called `get_friends_data` with the `user_id` as an argument. -This new function should return a pandas DataFrame with the information in the `users` DataFrame for each friend of `user_id`. - -In this function we want to add `type hints`. We have added `type hints` in most of our functions so far this spring, but now we want to pay attention to them. -These `type hints` provide a formal way to indicate the type of arguments passed to a function in python. -You can learn more about python type hints https://www.pythontutorial.net/python-basics/python-type-hints/[here] or https://docs.python.org/3.8/library/typing.html[this] is another good site for information. -Go ahead and use `type hints` into your function. Use one for the `user_id` and one for the returned data. - -These are the three examples that I tested in the videos: -[source, python] ----- -get_friends_data("ntlvfPzc8eglqvk92iDIAw") -get_friends_data("AY-laIws3S7YXNl_f_D6rQ") -get_friends_data("xvu8G900tezTzbbfqmTKvA") ----- - -.Helpful Hint -[%collapsible] -==== -a `type hint` for a string appears as `str` in our function -[source, python] ----- -get_friends_data(myuserid: str) ----- -==== -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== TWO - -++++ - -++++ - -++++ - -++++ - - -Next, we need to write a function called `calculate_avg_business_stars` that accepts the `business_id` and returns the average number of stars that the business has received. - -Doing this we can use the `groupby()` function. This is one of the most powerful functions for (more easily) working with data frames in python. - -These are the three examples that I tested in the videos: -[source, python] ----- -calculate_avg_business_stars('-MhfebM0QIsKt87iDN-FNw') -calculate_avg_business_stars('5JxlZaqCnk1MnbgRirs40Q') -calculate_avg_business_stars('faPVqws-x-5k2CQKDNtHxw') ----- - -.Insider Information -[%collapsible] -==== -- groupby()- allows us group data according to categories and also can help us compile and summarize data easily. -==== - -.Items to submit -==== -- Code used to answer the question. -- Result of code. -==== - -=== THREE - -++++ - -++++ - -Next up, we need to write a function called `visualize_stars_over_time`. It needs to accept a `business_id` as an argument. It should make a line plot that shows the average number of stars for each year that the business has reviews. - -These are the three examples that I tested in the videos: -[source, python] ----- -visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw") -visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q") -visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw") ----- - -.Helpful Hint -[%collapsible] -==== -You will need to import matplotlib.pyplot as plt -==== - - - -=== FOUR - -++++ - -++++ - - -We are now going to add an argument to our function `visualize_stars_over_time` that we wrote in question 3. This argument is called `granularity`. Granularity should accept one of two strings, either `"years"` or `"months"` (don't forget the double quotes!) that will help show the average rating over time. - -These are the nine examples that I tested in the videos: -[source, python] ----- -visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw") -visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "years") -visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "months") - -visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q") -visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "years") -visualize_stars_over_time("5JxlZaqCnk1MnbgRirs40Q", "months") - -visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw") -visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "years") -visualize_stars_over_time("faPVqws-x-5k2CQKDNtHxw", "months") ----- - - -.Insider Information -[%collapsible] -==== -Granularity indicates how much data can be shown on a chart. It can expressed in units of time, it can be - "minute" - "hour" - "day" - "week" - "month" - "year". -==== - -.Items to submit -==== -- Code used to answer the question -- Result of the code -==== - - - -=== FIVE - -++++ - -++++ - -Now we continue to modify the function `visualize_stars_over_time` that we were working on, from questions 3 and 4. We want to add the ability to accept multiple business_ids, and create a line for each id. (It is OK to remove the functionality about "granularity" from Project 4; just make the plots with the yearly summaries, and do not worry about the monthly granularity. You can just remove anything about the granularity.) - -This is the example that I tested in the videos: -[source, python] ----- -visualize_stars_over_time("-MhfebM0QIsKt87iDN-FNw", "5JxlZaqCnk1MnbgRirs40Q", "faPVqws-x-5k2CQKDNtHxw") ----- - - -.Items to submit -==== -- Code used to answer the question -- Result of the code -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project09.adoc deleted file mode 100644 index c05b1de73..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project09.adoc +++ /dev/null @@ -1,186 +0,0 @@ -= TDM 10200: Project 9 -- Spring 2023 - - -**Motivation:** Working in pandas can be fun! Learning how to wrangle data and clean up data in pandas is a helpful tool to have in your tool belt! - -**Context:** Now that we are feeling more comfortable with building functions and using pandas we want to continue to build skills and use pandas to solve data driven problems. - -**Scope:** python, pandas, numpy - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) -When launching Juypter Notebook on Anvil you will need to use 2 cores. - -The following questions will use the following dataset(s): - -`/anvil/projects/tdm/data/disney/total.parquet` - - -.Helpful Hints -[%collapsible] -==== -[source,python] ----- -import pandas as pd -disney = pd.read_parquet('/anvil/projects/tdm/data/disney/total.parquet') ----- -==== - - - -.Insider Knowledge -[%collapsible] -==== -It is helpful to use a `Parquet` file when we need efficient storage. If we tried to read in all the .csv files in the disney folder the kernel would crash. In short a `Parquet` file allows for high performance data compression and encoding schemes to deal with large amounts of complex data. The format is a column-oriented file format while .csv's tend to be row-oriented. + -You can read more about what row vs column oriented databases are https://dataschool.com/data-modeling-101/row-vs-column-oriented-databases/[here]. -==== - -=== ONE - -++++ - -++++ - -++++ - -++++ - - -Luckily this data is being read in as already cleaned data. It also has been recently updated and has a lot more information, i.e., it has more data from more rides. - - -[loweralpha] -.. Since there is a lot of new ride data, let's print the name of each ride. -.. How many rows of data are there for each ride? -.. What is different about the information that you receive if you use the groupby() vs value_counts()? Which one yields the information asked by question 1b? Why? -.. Go ahead and import the `numpy` package and see if you can find the frequency of JUST the ride named `hall_of_presidents` from the column `ride_name`. Under *Helpful Hint* there are two different ways to do that, but can you come up with a third? - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -import numpy as np -disney[disney.ride_name == 'hall_of_presidents'].shape[0] -#OR -import numpy as np -(disney['ride_name']=='hall_of_presidents').sum() ----- -==== - -.Insider Knowledge -[%collapsible] -==== -* Note that, before it gives you all the unique values in the column `ride_name`, it tells you that it is an array. An array is a ordered collection of elements where every value has the same data type. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a,b,c,d -==== - -=== TWO - -++++ - -++++ - -++++ - -++++ - -Create a new function that accepts a ride name as an argument, and prints two things: (1) the first year the data for that ride was collected, and (2) the most recent year that the data for that ride was collected. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ - -++++ - -Notice that the dataset has two columns `SPOSTMIN` and `SACTMIN`. Each row has either a value for `SPOSTMIN` or `SACTMIN` but not both. - -[loweralpha] -.. How many total rows of data do we have? -.. How many non-null rows for `SPOSTMIN`? -.. How many non-null rows for `SACTMIN`? -.. Combine columns `SPOSTMIN` and `SACTMIN` to create a new variable named `newcolumn` -.. What is the length of `newcolumn`? Is that the same as the number of rows in the `disney` dataframe? - -.Helpful Hints -[%collapsible] -==== -It might be useful to use the `combine_first` function for question 3d: - -https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine_first.html -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a,b,c,d,e -==== - -=== FOUR - -++++ - -++++ - -[loweralpha] -.. Find the max and min `SACTMIN` time for each ride -.. Find the max and min `SPOSTMIN` time for each ride -.. Find the average `SPOSTMIN` time for each ride -.. Find the average `SACTMIN` time for each ride - -.Helpful Hint -[%collapsible] -==== -Note that the value `-999` indicates that the attraction was closed. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a-d -==== - -=== FIVE - -++++ - -++++ - -++++ - -++++ - -[loweralpha] -.. Find the date that each ride was most frequently checked. -.. What was the most commonly closed ride? (Again, note that the value `-999` indicates that the attraction was closed.) - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a and b -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project10.adoc deleted file mode 100644 index 28862b347..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project10.adoc +++ /dev/null @@ -1,129 +0,0 @@ -= TDM 10200: Project 10 -- Spring 2023 - - -**Motivation:** Being able to create accurate and telling visualizations is a skill that we can develop. Being able to analyze and create good visualizations is an invaluable tool. - -**Context:** We are going to take a bit of a pause and work on learning some ways to create visualizations. Examine some plots, write about them, and then use your creative minds to create visualizations about the data. - - -**Scope:** python, visualizing data - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -`/anvil/projects/tdm/data/beer/beers.csv` - - - -.Insider Knowledge -[%collapsible] -==== -Python has several packages that help with creating data visualizations. Listed below are some of the most popular packages, these include (but are not limited to) -* Matplotlib: a 2-D plotting library - * Works with NumPy arrays and allows for a large number of plots to help easier understand trends and make correlations. It is *not* ideal for time series data -* Plotly: allows for the creation of easy to understand interactive plots. - * Has 40 unique chart and plot types, but is not beginner friendly -* GGplot: One of the more popular in the Python library It maps data and allows for attributes to be changed including color, shape, and even geometric objects. - * Can store data in a dataframe, you can build informative visualizations because of the different ways you can represent the data. -* Pygal: Allows the download of visualizations into different formats. Can be used to create an interactive experience. - * It can become slow if it has too large of number of data points, but it allows users to still create wonderful visualizations even in complex problems. -* Geoplotlib: Buildable maps and plot geographical data using this library. It is able to use large datasets. - * Has the ability to create various maps, including dot maps, heat maps, area maps, and point density maps. -* Plotnine: Based on `R's` ggplot2 package, it supports the creation of complex plots from data in a dataframe. -* Seaborn: Based on matplotlib. It can efficiently represent data that is stored in a table, array, list and other data structures. -==== - -=== ONE - -++++ - -++++ - -https://www.amazon.com/dp/0985911123/[Creating More Effective Graphs] by Dr. Naomi Robbins and https://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414/ref=sr_1_1?dchild=1&keywords=elements+of+graphing+data&qid=1614013761&sr=8-1[The Elements of Graphing Data] by Dr. William Cleveland at Purdue University, are two excellent books about data visualization. Read the following excerpts from the books (respectively), and list 2 things you learned (or found interesting) from each book (i.e., 4 things you learned altogether). - -https://thedatamine.github.io/the-examples-book/files/CreatingMoreEffectiveGraphs.pdf[Excerpt 1] - -https://thedatamine.github.io/the-examples-book/files/ElementsOfGraphingData.pdf[Excerpt 2] - - -.Items to submit -==== -- Answers as to what two things you found interesting or learned from each book excerpt. -==== - -=== TWO - -++++ - -++++ - -Data visualizations are an important part of data analysis. Visualizations can summarize large amounts of data in a picture. There are many ways to choose to represent the data. Sometimes the hardest part of the analysis process is finding which chart is best to represent your data. - -Read in the dataset `/anvil/projects/tdm/data/beer/beers.csv`. Take a moment to look at the columns and the head and tail of the data. What is some information that you could use to create a data visualization? - -Give 3 different types of information that you could show from this dataframe. Then tell which type of chart you would use and why. - -Example: Show a country's beer lineup and the average ratings for each beer. - -.Insider Knowledge -[%collapsible] -==== -Common reasons you would use data visualizations: - * showing change over time - * showing part-to-whole composition - * showing how data is distributed - * comparing values between groups - * observing relationships between variables - * looking at geographical data -https://chartio.com/learn/charts/how-to-choose-data-visualization//["How to Choose the Right Data Visualization"] by Mike Yi and Mel Restori - -https://chartio.com/learn/charts/essential-chart-types-for-data-visualization/["Essential Chart Types for Data Visualization"] by Mike Yi and Mary Sapountzis -==== - -.Items to submit -==== -- 3 types of information you would show from this dataframe and which chart you would choose to represent it. -==== - -=== THREE - -++++ - -++++ - -Using the `pycountry` library, take the `country` abbreviations and create a new column with the full country names. Rename the country column to `country_code` and the new column should be named `country_name`. - -Find how many times each type of beer (in the `style` column) occurs in each country. Create a new dataframe with this information. - -Make a visualization that compares beers made in eastern Europe. - -.Helpful Hint -[%collapsible] -==== -Eastern European countries include Albania, Bosnia and Herzegovina, Bulgaria, Croatia, the Czech Republic, Estonia, Hungary, Kosovo, Latvia, Lithuania, the Republic of North Macedonia, Moldova, Montenegro, Poland, Romania, Serbia, Slovakia, Slovenia and Ukraine. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR -Take the 3 things that you listed in question *TWO* and go ahead and create the visualizations for each of them. - -.Items to submit -==== -- The three visualizations -- Code used to solve this problem. -- Output from running the code. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project11.adoc deleted file mode 100644 index 394cf1a42..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project11.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= TDM 10200: Project 11 -- Spring 2023 - - -**Motivation:** Learning how to merge/join dataframes to access more information. - -**Scope:** python, pandas, os - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -'/anvil/projects/tdm/data/fars' - - -=== ONE - -++++ - -++++ - -Go ahead and list what is in `/anvil/projects/tdm/data/fars`. You do not need to read in all of the files. The `fars` directory contains years including 1975-2017. Each year also contains at least 3 CSV files. The one that we will be looking at is the `ACCIDENTS.CSV` - -[loweralpha] -.. List what files are in the year 1985 -.. Read in the `ACCIDENTS.CSV` and then go ahead and change the values in the `YEAR` column from two digits to four digits. For example, we should change `89` to `1989`. Do this by adding a `19` to each year value. -.. Now combine the `MONTH`, `DAY`, `YEAR` columns into a new column called `DATE` - -.Helpful Hint (for b) -[%collapsible] -==== -We can append strings to every value in a column by first converting the column to `str` using `astype` then use the `+` operator: -[source,python] ----- -myDF["myCol"].astype(str) + "appending_this_string" ----- - -* append in coding takes an object and adds it to an existing list -==== - - -.Helpful Hint (for c) -[%collapsible] -==== -If you see the numbers 99 or 9 it is an indicator that the information is unknown. -If you want to learn more https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813251[see here] -==== - - - -.Items to submit -==== -- Answers to the questions a,b,c above. -- Code used to solve this problem. -- Output from running the code. -==== - -=== TWO - -++++ - -++++ - -What we want to do now is create a Dataframe called `accidents` that joins the `ACCIDENT.CSV` files from the years 1985-1989 (inclusive) into one large Dataframe. - - -.Insider Knowledge -[%collapsible] -==== -The `Pandas` library has three main functions that combine data. + -*merge()* is typically used for combining data based on common columns or indices. Merge is similar to the join function in SQL. Important to note that merge() will default to an inner join unless specified. + -*join()* is typically used for combining data based on a key column or an index. + -*concat()* is typically used for combining *Dataframes* across rows or columns. + - -There are several different forms of `joins` we will just discuss two here. - -* inner-will return only matching rows from the tables, you will lose the rows that do not have a match in the other Dataframe's key column. -* outer- will return every row from both the left and right dataset. If the left dataset does not have a value for a specific row it will be left empty rather than the entire row be removed same goes for the right dataset - - -A great visual can be found https://3.bp.blogspot.com/-JlOyxor09jk/UAJrk_wvGxI/AAAAAAAAABI/lRilqPIw82I/s1600/Visual_SQL_JOINS.jpg[here] -==== - -.Items to submit -==== -- Answers to the question above -- Code used to solve this problem -- Output from running the code. -==== - -=== THREE - -++++ - -++++ - -Using the new `accidents` Dataframe that you just created, let's take a look at some of the data. - -[loweralpha] -.. Change the values in the `YEAR` column from a 2 digit year to a 4 digit year, like we did in the last question, but using a different method. -.. How many accidents are there in which one or more drunk drivers were involved in an accident with a school bus? - -.Helpful Hint (for a) -[%collapsible] -==== -use the `to_datetime` function -[source, python] ----- -df[''] = pd.to_datetime(df[''], format='%y').dt.strftime('%Y') ----- -==== - -.Helpful Hint (for b) -[%collapsible] -==== -look at the specifically the variables `DRUNK_DR` and `SCH_BUS` -==== - -.Items to submit -==== -- Answers to the two questions -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -[loweralpha] -.. Find how many accidents happen in total per year between 1 or more drunk drivers and school bus. - ... what year had the lowest number of accidents - ... what year had the most number of accidents -.. Now we want to consider which days of the week had the most accidents occur -.. Is there a time of day where you see more accidents? Using 12am-6am/ 6am-12pm/ 12pm-6pm/ 6pm-12am as your time frames. - -.Items to submit -==== -- Answers to the 3 questions above -- Code used to solve this problem. -- Output from running the code. -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project12.adoc deleted file mode 100644 index 75fd2e081..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project12.adoc +++ /dev/null @@ -1,245 +0,0 @@ -= TDM 10200: Project 12 -- Spring 2023 - - -**Motivation:** Learning classes in Python - -**Scope:** Object Oriented Python - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Objects and Classes in Python - -A class can be considered a outline made by the user for creating objects. An easy way to think of it is class is a blueprint or sketch of a of a car, it contains all of the details we need to know, windows, doors, engine etc. Based on all the details and descriptions we build the vehicle. The car itself is the object. Since we can have many cars created using the same description and details the idea is that we can create many objects from a class. An object is an `instance` of the class with actual values. An object is an (actual) specific car. A red compact car with black interior and tinted windows. You can now create multiple instances, using the blueprint of a class as a guide, to help you know what information is required. - -Objects consist of: - -* State: attributes/properties of an object -* Behavior: the methods of the object and also its response to other objects -* Identity: gives a unique name to the object and that allows for interaction between objects. - -.Vehicle Example -|=== -| Identity | State/Attributes | Behavior -| Miata -| Mazda, Red, 2019 -| acceleration, steering, braking -|=== - -Instantiating a Class == Declaring Objects -All instances share the same attributes and the same behavior but the values of the attributes(State) are unique. - -.Insider Knowledge -[%collapsible] -==== -* https://www.programiz.com/python-programming/class[Python Objets and Classes] - -* https://www.geeksforgeeks.org/python-classes-and-objects/[Python Classes and Objects] -==== - -=== ONE - -++++ - -++++ - -Let's start simple and declare an object -[source,python] ----- -class CAR: - #attibute - attr1 = "red" - attr2 = "fast" - - #sample method - #self refers to the class of which the method belongs - def fun(self): - print("The car is", self.attr1) - print("The car is", self.attr2) -#object instantiaton -Miata = CAR() ----- -[loweralpha] -.. What happens when you type: `print(Miata)` -.. What happens when you type: `Miata.fun()` -.. Now create and declare your own object. We encourage you to make a class of your own and try it out, e.g., make a class for a `BOOK` or a `KITCHEN` or a `UNIVERSITY` or a `DOG`. - -You might want to also experiment further with the Miata, e.g., you might want to try: `Miata.attr1` and `Miata.attr2` and add some attributes. It is worthwhile to play a little bit with the `CAR`! - -.Items to submit -==== -- Answers to the questions a,b,c above. -- Code used to solve this problem. -- Output from running the code. -==== - - -=== TWO - -++++ - -++++ - -Lets play with cards! + -The code below defines two different types of classes: - -[source, python] ----- -class Card: -#mapping each possible card number(2-10,J,Q,K,A) to a numerical value - _value_dict = {"2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8":8, "9":9, "10": 10, "j": 11, "q": 12, "k": 13, "a": 14} - def __init__(self, number, suit): - if str(number).lower() not in [str(num) for num in range(2, 11)] + list("jqka"): - raise Exception("Number wasn't 2-10 or J, Q, K, or A.") - else: - self.number = str(number).lower() - if suit.lower() not in ["clubs", "hearts", "diamonds", "spades"]: - raise Exception("Suit wasn't one of: clubs, hearts, spades, or diamonds.") - else: - self.suit = suit.lower() - - def __str__(self): - return(f'{self.number} of {self.suit.lower()}') - - def __repr__(self): - return(f'Card(str({self.number}), "{self.suit}")') - - def __eq__(self, other): - if self.number == other.number: - return True - else: - return False - - def __lt__(self, other): - if self._value_dict[self.number] < self._value_dict[other.number]: - return True - else: - return False - - def __gt__(self, other): - if self._value_dict[self.number] > self._value_dict[other.number]: - return True - else: - return False - - def __hash__(self): - return hash(self.number) - -class Deck: - brand = "Bicycle" - _suits = ["clubs", "hearts", "diamonds", "spades"] - _numbers = [str(num) for num in range(2, 11)] + list("jqka") - - def __init__(self): - self.cards = [Card(number, suit) for suit in self._suits for number in self._numbers] - - def __len__(self): - return len(self.cards) - - def __getitem__(self, key): - return self.cards[key] - - def __setitem__(self, key, value): - self.cards[key] = value ----- - -[loweralpha] -.. Create an instance of the class Card and call it `my_card`. Then run: `print(my_card)` and also run: `my_card` -.. What is the difference in the output? -.. Create an instance of the class Deck and call it `my_deck`. Now what is the number of items you will find in the object `my_deck` - -.Helpful Hint (for c) -[%collapsible] -==== -[source, python] ----- -len(my_deck) ----- -==== - -It is important to point out that a Python function inside a `class` is called a method. -We can initialize values using constructors there is an - -[source, python] ----- -__int__() ----- - -function that is called whenever a new object of that class is instantiated. - -=== THREE - -++++ - -++++ - -Modify the Class Deck to return a string that says "a bicycle deck with 52 cards". - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== FOUR - -++++ - -++++ - -Let's create a new class called `Player` We will use this to represent a player in a game. -The following features must be included: - -* A `deck` to draw from -* A `hand` of cards -* The `name` of the player -* A `draw` method that draws a card from the deck and adds it to the hand. - -.Helpful Hint -[%collapsible] -==== -Knowing that each person will have a different name, the `name` attribute will be an instance attribute. The `name` argument will be used to assign a name to a player, and the `deck` argument is used to assign the deck to the player. The hand of cards should be an empty list at initialization. The draw method will be used to draw a card from the deck and add it to a player's hand. -==== - -.Items to submit -==== -- Answers to the question above -- Code used to solve this problem -- Output from running the code. -==== - -=== FIVE - -++++ - -++++ - -What card does Liz draw? Create a `Deck` and a `Player`, and draw a card from the deck. Print the value on the card that is drawn. - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -my_deck1 = Deck() -player1 = Player("Liz", my_deck1) -card = player1.draw() -print(card) ----- -==== - - -.Items to submit -==== -- The answer to the question above. -- Code used to solve this problem. -- Output from running the code. -==== - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project13.adoc deleted file mode 100644 index 5a37b0478..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project13.adoc +++ /dev/null @@ -1,230 +0,0 @@ -= TDM 10200: Project 13 -- Spring 2023 - - -**Motivation:** Removing 'stopwords' and creating 'WordClouds' to help us get a better feel for a large amount of text. These are first steps in enabling us to understand the sentiment of some text. Such techniques can be helpful when looking at text, for instance, at reviews. - - -**Scope:** python, nltk, wordcloud - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) -When launching Juypter Notebook on Anvil you will need to use 4 cores. - -The following questions will use the following dataset(s): - -`/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv` - - -.Helpful Hints -[%collapsible] -==== -[source,python] ----- -import pandas as pd -finefood = pd.read_csv("/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv") ----- -==== - -=== ONE - -++++ - -++++ - -Once you have read in the dataset, take a look at the `.head()` of the data. - - -[loweralpha] -.. Immediately we see two columns that might be interesting: `HelpfulnessNumerator` and `HelpfulnessDenominator`. What do you think those mean, and what would they (potentially) be used for? -.. What is the `user id` of review number 23789? -.. How many duplicate `ProfileName` values are there? (I am not asking for which values are duplicated but just the total number of duplicated `ProfileName` values; it is helpful to explain your answer for this one.) - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -df.columnname.duplicated().sum() ----- -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a,b,c -==== - -=== TWO - -++++ - -++++ - -Now we are going to focus on three more columns: - -* `Score` : customer's product rating -* `Text` : the full review written by the customer - -We can see that the rating system is a numerical value in the range 0-5. A rating of *0* is the worst rating available and *5* is the best rating available. -We want to start by getting a feel for the ratings, e.g., do we have more negative than positive reviews? The easiest way to see this is to plot the data. - -[loweralpha] -.. What type of visualization did you choose to represent the score data? -.. Why did you choose it? -.. What do you notice about the results? - -.Insider Information -[%collapsible] -==== -Common reasons we would want to use data visualizations is to (this is not an exhaustive list) - -* show change over time (bar charts, line charts, box plots) -* compare a part to the whole (pie chart, stacked bar chart, stacked area charts) -* we want to see how the data is distributed (bar chart, histogram, box plot, etc., you have freedom to choose and explore) -* when we want to compare values amongst different groups (bar chart, dotplot, line chart, grouped bar chart) -* when we are observing relationships variables (scatter plot, bubble chart, heatmaps) - -==== -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -#for a histogram -import matplotlib.pyplot as plt -import seaborn as sns -color = sns.color_palette() -%matplotlib inline -import plotly.offline as py -py.init_notebook_mode(connected=True) -import plotly.graph_objs as go -import plotly.tools as tls -import plotly.express as px - -fig = px.histogram(df, x="columnname") -fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)', - marker_line_width=1.5) -fig.update_layout(title_text='Whateveryouwanttonameit') -fig.show() - -#for a piechart -import matplotlib.pyplot as plt -rating_counts = df["columnname"].value_counts() -plt.pie(rating_counts, labels=rating_counts.index) -plt.title("whateveryouwantotnameit") -plt.show() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== THREE - -++++ - -++++ - -In the `Text` column, you will see that there are a lot of commonly used words. Specifically, in English, we are talking about "the", "is", "and" etc. What we want to do is eliminate the unimportant words, so that we can focus on the words that will tell us more information. + -An example: "There is a dog over by the creek, laying down by the tree" -When we remove the stop words we will get: - -There + -dog + -over + -creek + -laying + -down + -tree + - -This allows us to focus on information that can be used for classification or clustering. -There are several Natural Language Processing libraries that you can use to remove stop words. - -* Stop words with NLTK -* Stop words with Gensim -* Stop words with SpaCy - -Go ahead and remove the stop words from the column "Text" in our dataframe . - - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -import nltk -from nltk.corpus import stopwords -nltk.download('stopwords') ----- -==== - -.Insider Knowledge -[%collapsible] -==== -A few resources to read up on about stop words: - -* https://machinelearningmastery.com/clean-text-machine-learning-python/[Cleaning Text for Machine Learning with Python] -* https://kavita-ganesan.com/what-are-stop-words/#.ZDgbB1LMKAQ[What are Stop Words?] -==== -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the question -==== - -=== FOUR - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -A Word Cloud is a way that we can represent text. The size of each word indicates its frequency/importance. - -Now that we have removed the stop words, we can focus on finding the significant words, to see if they are more positive or negative in sentiment. -[loweralpha] -.. Create a wordcloud from the column "Text" that should have all the stop words taken out of it. -.. Are there any additional "stop words" or words that are unimportant to your analysis that you could take out (an example could be cant, gp, br, hef, etc)? -.. Take out those additional stop words and then create a new wordcloud. What do you notice? - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -from wordcloud import WordCloud - -#ways to add new stop words to the list -.append() -.extend(newlist) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the questions -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project14.adoc deleted file mode 100644 index 8bcc834f1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-project14.adoc +++ /dev/null @@ -1,193 +0,0 @@ -= TDM 10200: Project 14 -- Spring 2023 - - -**Motivation:** As the last project of the semester, lets take a moment and do something that caters to those that have a bit more of a creative artsy side. Data is fun, and data visualizations are a great way to include our creativity. -Last project we did a word cloud. This project we are going to do a word cloud but we are going to change the way it is shown. - - - -**Scope:** python, nltk, wordcloud, matplotlib.pyplot, numpy, PIL - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) -When launching Juypter Notebook on Anvil you will need to use 4 cores. - -The following questions will use the following dataset(s): - -`/anvil/projects/tdm/data/icecream` - -`/anvil/projects/tdm/data/icecream/icecream.png` - -`/anvil/projects/tdm/data/icecream/bj/products.csv` - -`/anvil/projects/tdm/data/icecream/combined/reviews.csv` - -=== ONE - -++++ - -++++ - -We are going to look at Ben and Jerry's product information. There are two CSV files: - -* `combined/reviews.csv` -* `bj/products.csv` - -Take a look at the head of both of the dataframes, to get familiar with the data in these two files. Then consider: - -[loweralpha] -.. What are the column names for `combined/reviews.csv`? What are the column names for `bj/products.csv`? -.. What column do they have in common? -.. Go ahead and merge these two data frames based on the common column (do not merge on a column that has `NaN`), and save the results from the merge as a new dataframe. -.. Find a second way to merge the data frames. - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- -(pd.merge(df1, df2, on='column')) ----- -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to questions a,b,c,d -==== - -=== TWO - -++++ - -++++ - -Now that we have merged the two dataframes, let's take a look at the columns again. - -[loweralpha] -.. What do you notice about the column ingredients that was originally in the products dataframe for the `bj` data? Why do you think this happened? -.. What happens and why if you tried to merge the dataframes on the column ingredients? -.. What should we do instead, if we want to merge on ingredients? - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answers to a,b,c -==== - -=== THREE - -++++ - -++++ - -Let's create a word cloud with the ingredients in all of the Ben and Jerry icecream. -Remove all the stop words. Afterwards, we want to focus on the words that appear most frequently. -Go ahead and play with the parameters of the WordCloud function. We want you to get comfortable with WordCloud. - -.Insider Information -[%collapsible] -==== -* max_font_size: This argument defines the maximum font size for the biggest word. If none, adjust as image height. -* max_words: It specifies the maximum number of the word, default is 200. -* background_color: It set up the background color of the word cloud image, by default the color is defined as black. -* colormap: using this argument we can change each word color. Matplotlib colormaps provide awesome colors. -* background_color: It is used for the background color of the word cloud image. -* width/height: we can change the dimension of the canvas using these arguments. Here we assign width as 3000 and height as 2000. -* random_state: It will return PIL(python imaging library) color for each word, set as an int value. -==== - -.Helpful Hint -[%collapsible] -==== -[source,python] ----- - -import matplotlib.pyplot as plt -from wordcloud import WordCloud, ImageColorGenerator -from PIL import Image -import numpy as np -from wordcloud import STOPWORDS - -import nltk -from nltk.probability import FreqDist ----- -==== - - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the question -==== - -=== FOUR - -++++ - -++++ - -Now for the best part, let's create a custom shape. The WordCloud function has an argument called `mask` that enables it to take maskable images and use them as the outline the word cloud we created. -In the dataset there is an image named `icecream.png`. This image meets the requirement of having a background that is completely white (the color code is `#ffffff`). -Go ahead and create the word cloud! -What do you see? - -.Helpful Hint -[%collapsible] -==== -image::figure52.webp[colorful word cloud in the shape of an ice-cream cone, width=792, height=500, loading=lazy, title="colorful word cloud in the shape of an ice-cream cone"] -==== - - - -.Insider Information -[%collapsible] -==== -* mask: Specify the shape of the word cloud image. By default, it takes a rectangle. -* Contour_width: This parameter creates an outline of the word cloud mask. -* Contour_color: Contour_color use for the outline color of the mask image. -==== - -.Helpful Hint -[%collapsible] -==== -[source, python] ----- -# Load the image mask -icecream_mask = np.array(Image.open('path')) - -# Extract the text to use for the word cloud -text = " ".join(str(each) for each in df.columnname) - -# Create a WordCloud object with the mask -wordcloud = WordCloud(max_words=200, colormap='Set1', background_color="white", mask=icecream_mask).generate(text) - -# Display the word cloud on top of the image -fig, ax = plt.subplots(figsize=(8, 6)) -ax.imshow(wordcloud, interpolation="bilinear") -ax.axis('off') - -plt.show() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Answer to the questions -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-projects.adoc deleted file mode 100644 index eaa7e9a4d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/10200/10200-2023-projects.adoc +++ /dev/null @@ -1,47 +0,0 @@ -= TDM 10200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$10200-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2022 to Spring 2023. -==== - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2022/tdm10100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2022/tdm10100/home] - - -== Syllabus - -Navigate to the xref:spring2023/logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project01.adoc deleted file mode 100644 index 0391c9e64..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project01.adoc +++ /dev/null @@ -1,258 +0,0 @@ -= TDM 20200: Project 1 -- 2023 - -**Motivation:** Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like https://arrow.apache.org/faq/[parquet] and https://developers.google.com/protocol-buffers[protobufs] are becoming more common. - -**Context:** In this project we will use the `lxml` package in Python. This is the first project in a series of 5 projects focused on web scraping in Python. - -**Scope:** python, XML - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Match XML terms to sections of XML demonstrating working knowledge. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/otc/hawaii.xml` - -== Questions - -[IMPORTANT] -==== -It would be well worth your time to: - -. Read the https://the-examples-book.com/starter-guides/data-formats/html[HTML section of the book]. -. Read through the https://the-examples-book.com/starter-guides/data-formats/xml[XML section of the book]. -. Finally, take the time to work through https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html[`pandas` 10 minute intro]. -==== - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -[TIP] -==== -Check out this old project that uses a different dataset -- you may find it useful for this project. - -https://thedatamine.github.io/the-examples-book/projects.html#p01-290 -==== - -One of the challenges of XML is that it can be hard to get a feel for how the data is structured -- especially in a large XML file. A good first step is to find the name of the root node. Use the `lxml` package to find and print the name of the root node. - -Interesting! If you took a look at the previous project, you _probably_ weren't expecting the extra `{urn:hl7-org:v3}` part in the root node name. This is because the previous project's dataset didn't have a namespace! Namespaces in XML are a way to prevent issues where a document may have multiple sets of node names that are identical but have different meanings. The namespaces allow them to exist in the same space without conflict. - -Practically what does this mean? It makes XML parsing ever-so-slightly more annoying to perform. Instead of being able to enter XPath expressions and return elements, we have to define a namespace as well. This will be made more clear later. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -XML can be nested -- there can be elements that contain other elements that contain other elements. In the previous question, we identified the root node AND the namespace. Just like in the previous Spring 2021 project 1 (linked in the "tip" in question 1), we would like you to find the names of the next "tier" of elements. - -This will not be a copy/paste of the previous solution. Why? Because of the namespace! - -First, try to use the same method from question (2) from https://thedatamine.github.io/the-examples-book/projects.html#p01-290[this project] to find the next tier of names. What happens? - -[source,python] ----- -hawaii.xpath("/document") # won't work -hawaii.xpath("{urn:hl7-org:v3}document") # still won't work with the namespace there ----- - -How do we fix this? We must define our namespace, and reference it in our XPath expression. For example, the following will work. - -[source,python] ----- -hawaii.xpath("/ns:document", namespaces={'ns': 'urn:hl7-org:v3'}) ----- - -Here, we are passing a dict to the namespaces argument. The key is whatever we want to call the namespace, and the value is the namespace itself. For example, the following would work too. - -[source,python] ----- -hawaii.xpath("/happy:document", namespaces={'happy': 'urn:hl7-org:v3'}) ----- - -So, unfortunately, _every_ time we want to use an XPath expression, we have to prepend `namespace:` before the name of the element we are looking for. This is a pain, and unfortunately we cannot just define it once and move on. - -Okay, given this new information, please find the next "tier" of elements. - -[TIP] -==== -There should be 8. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Okay, lucky for you, this XML file is not so big! Use your UNIX skills you refined last semester to print the content of the XML file. You can print the entirety in a `bash` cell if you wish. - -You will be able to see that it contains information about a drug of some sort. - -Knowing now that there are `ingredient` elements in the XML file. Write Python code, and an XPath expression to get a list of all of the `ingredient` elements. Print the list of elements. - -[NOTE] -==== -When we say "print the list of elements", we mean to print the list of elements. For example, the first element would be: - ----- - - - - DIBASIC CALCIUM PHOSPHATE DIHYDRATE - - ----- -==== - -To print an `Element` object, see the following. - -[source,python] ----- -print(etree.tostring(my_element, pretty_print=True).decode('utf-8')) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -++++ - -++++ - -At this point in time you may be wondering how to actually access the bits and pieces of data in the XML file. - -There is data between tags. - -[source,xml] ----- -DIBASIC CALCIUM PHOSPHATE DIHYDRATE ----- - -To access such data from the "name" `Element` (which we will call `my_element` below) you would do the following. - -[source,python] ----- -my_element.text # DIABASIC CALCIUM PHOSPHATE DIHYDRATE ----- - -There is also data tucked away in a tag's attributes. - -[source,xml] ----- - ----- - -To access such data from the "name" `Element` (which we will call `my_element` below) you would do the following. - -[source,python] ----- -my_element.attrib['code'] # O7TSZ97GEP -my_element.attrib['codeSystem'] # 2.16.840.1.113883.4.9 ----- - -The aspect of XML that we are interested in learning about are XPath expressions. XPath expressions are a clear and effective way to extract elements from an XML document (or HTML document -- think extracting data from a webpage!). - -In the previous question you used an XPath expression to find all of the `ingredient` elements, regardless where they were or how they were nested in the document. Let's practice more. - -If you look at the XML document, you will see that there are a lot of `code` attributes. Use `lxml` and XPath expressions to first extract all elements with a `code` attribute. Print all of the values of the `code` attributes. - -Repeat the process, but modify your **XPath expression** (not your Python code, just the XPath expression) so that it only keeps elements that have a `code` attribute that starts with a capital "C". Print all of the values of the `code` attributes. - -[TIP] -==== -You can use the `.attrib` attribute to access the attributes of an `Element`. It is a dict-like object, so you can access the attributes similarly to how you would access the values in a dictionary. -==== - -[TIP] -==== -https://stackoverflow.com/questions/6895023/how-to-select-xml-element-based-on-its-attribute-value-start-with-heading-in-x/6895629[This] link may help you when figuring out how to select the elements where the `code` attribute must start with "C". -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -The `quantity` element contains a `numerator` and a `denominator` element. Print all of the quantities in the XML file, where a quantity is defined as the value of the `value` attribute of the `numerator` element divided by the value of the `value` attribute of the corresponding `denominator` element. Lastly, print the `unit` (part of the `numerator` element afterwards. - -[TIP] -==== -The results should read as follows: - ----- -1.0 1 -5.0 g -7.6 mg -5.0 g -4.0 g -230.0 mg -4.0 g ----- -==== - -[TIP] -==== -You may need to use the `float` function to convert the string values to floats. -==== - -[TIP] -==== -You can use the `xpath` method on an `Element` object. When doing so, if you want to limit the scope of your XPath expression, make sure to start the xpath with ".//ns:" this will start the search from within the element instead of searching the entire document. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project02.adoc deleted file mode 100644 index 81f78ad36..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project02.adoc +++ /dev/null @@ -1,229 +0,0 @@ -= TDM 20200: Project 2 -- 2023 - -**Motivation:** Web scraping is is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. Depending on the task at hand, web scraping can be incredibly simple. With that being said, it can quickly become difficult. Typically, students find web scraping fun and empowering. - -**Context:** In the previous project we gently introduced XML and xpath expressions. In this project, we will learn about web scraping, scrape data from a news site, and parse through our newly scraped data using xpath expressions. - -**Scope:** Python, web scraping, XML - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -You will be extracting your own data from online in this project -- there is no provided dataset. - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -The Washington Post is a very popular news site. Open a modern browser (preferably Firefox or Chrome), and navigate to https://www.washingtonpost.com. - -[NOTE] -==== -Throughout this project, I will be referencing text and tools from Firefox. If you want the easiest experience, I'd recommend using Firefox for at least this project. -==== - -By the end of this project you will be able to scrape some data from this website! The first step is to explore the structure of the website. - -To begin exploring the website structure right click on the webpage and select "View Page Source". This will pull up a page full of HTML. This is the HTML used to render the page. - -Alternatively, if you want to focus on a single element on the web page, for example, an article title, right click on the title and select "Inspect". This will pull up an inspector that allows you to see portions of the HTML. - -Click around on the website and explore the HTML however you like. - -Open a few of the articles shown on the front page of the paper. Note how many of the articles start with some key information like: category, article title, picture, picture caption, authors, article datetime, etc. - -For example: - -https://www.washingtonpost.com/health/2022/01/19/free-n95-masks/ - -image::figure33.webp[Article components, width=792, height=500, loading=lazy, title="Article components"] - -Copy and paste the `header` element that is 1 level nested in the `main` element into a markdown cell in an HTML code block. Include _just_ the tag with the attributes -- don't include the elements nested within the `header` element. - -List the _keys_ of the _attributes_ of the `header` element. What are the _values_ of the _attributes_ of the `header` element? - -Do the same for the `article` element that is 1 level nested in the `main` element (after the `header` element). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -In question (1) we copied two elements of an article. When scraping data from a website, it is important to continually consider the patterns in the structure. Specifically, it is important to consider whether or not the defining characteristics you use to parse the scraped data will continue to be in the same format for new data. What do I mean by defining characterstic? I mean some combination of tag, attribute, and content from which you can isolate the data of interest. - -For example, given a link to a new Washington Post article, do you think you could isolate the article title by using the `class` attribute, `class="b-l br-l mb-xxl-ns mt-xxs mt-md-l pr-lg-l col-8-lg mr-lg-l"`? Maybe, or maybe not. It looks like those classes are used to structure the size, font, and other parts of the article. In a different article those may change, or maybe they wouldn't be _unique_ within the page (for example, if another element had the same set of classes in the same page). - -Take a minute to re-read the two paragraphs above. This is one of the key skills needed in order to consistently scrape data from a website. Websites change, and you need to do your best to use the parts of the webpage that are most likely to stay the same, to isolate the data you want to scrape. - -Write an XPath expression to isolate the article title, and another XPath expression to isolate the article summary or sub headline. - -[IMPORTANT] -==== -You do _not_ need to test your XPath expression yet, we will be doing that shortly. If your solution ends up being wrong in this question, you will have a chance to fix it in the next question. -==== - -[NOTE] -==== -Remember the goal of the XPath expression is to write it in such a way that we can take _any_ Washington Post article and extract the data we want. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Use the `requests` package to scrape the web page containing our article from questions (1) and (2). Use the `lxml.html` package and the `xpath` method to test out the XPath expressions you created in question (2). Use the expressions to extract the element, then print the _contents_ of the elements (what is between the tags). Did they work? Print the element contents to confirm. If they didn't, see the third tip below, and take the time to write new XPath expressions that work. - -[TIP] -==== -Check out https://the-examples-book.com/programming-languages/python/lxml#examples[these] examples for instructions on how to do this. -==== - -[TIP] -==== -Pass `stream=True` to the `requests` package `get` method. In addition, set `resp.raw.decode_content = True` to ensure that the content is decoded properly. - -[source,python] ----- -resp = requests.get(some_url, stream=True) -resp.raw.decode_content = True -# etc... ----- -==== - -[TIP] -==== -If your XPath expressions included the use of the `data-*` attributes, great job! You can read about the `data-*` attributes https://the-examples-book.com/starter-guides/data-formats/html#attributes[here]. `data-*` attributes are _typically_ kept as a website is updated, and are therefore a fairly reliable choice when trying to isolate data from a website. - -Search different articles on the same website to see if you can find the same `data-*` attributes you used to isolate the data. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Use your newfound knowledge of XPath expressions, `lxml`, and `requests` to write a function called `get_article_links` that scrapes the home page for The Washington Post, and returns 5 article links in a list. - -There are a variety of ways to do this, however, make sure it is repeatable, and _only_ returns article links. - -[TIP] -==== -Again, the `data-*` attributes are particularly useful for this problem. -==== - -[TIP] -==== -Here is some skeleton code to get you started: - -[source,python] ----- -import lxml.html -import requests - -def get_article_links(): - """ - Scrape the home page for The Washington - Post and return 5 article links. - """ - - # ... - - return links - -print(get_article_links()) ----- - -.example output ----- -['https://www.washingtonpost.com/climate-environment/2023/01/18/greenland-hotter-temperatures/', 'https://www.washingtonpost.com/climate-solutions/2023/01/18/coffee-pods-sustainability-environmental-impact/', 'https://www.washingtonpost.com/climate-environment/2023/01/18/jbs-food-giant-brazil-bonds/', 'https://www.washingtonpost.com/food/2023/01/17/spice-jar-germs/', 'https://www.washingtonpost.com/opinions/2023/01/16/republicans-whitewash-jan6-trump-insurrection/'] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -++++ - -++++ - -Write a function called `get_article_info` that accepts a link to an article as an argument, and prints the information in the following format: - -.Example output ----- -Title: White House to distribute 400 million free N95 masks starting next week -Authors: Lena H. Sun, Dan Diamond -Time: January 19, 2022 at 5:00 a.m. EST ----- - -[IMPORTANT] -==== -Of course, the Time section may change, we used the "Published" date in our solution. -==== - -In a loop, test out the `get_article_info` function with the links that are returned by your `get_article_links` function. - -[source,python] ----- -for link in get_article_links(): - print("-----------------") - get_article_info(link) - print("-----------------\n") ----- - -If your code works for all 5 articles, that is repeatable enough for now! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project03.adoc deleted file mode 100644 index 736f21078..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project03.adoc +++ /dev/null @@ -1,299 +0,0 @@ -= TDM 20200: Project 3 -- 2023 - -**Motivation:** Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from https://zillow.com. - -**Context:** In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let's you interact with a browser, selenium. - -**Scope:** python, web scraping, selenium - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - -++++ - - -Pop open a browser and visit https://zillow.com. Many websites have a similar interface -- a bold and centered search bar for a user to interact with. - -First, in your browser, type in `32607` into the search bar and press enter/return. There are two possible outcomes of this search, depending on the computer you are using and whether or not you've been browsing zillow. The first is your search results. The second a page where the user is asked to select which type of listing they would like to see. - -This second option may or may not consistently pop up. For this reason, we've included the relevant HTML below, for your convenience. - -[source,html] ----- -
- -
----- - -[TIP] -==== -Remember that the _value_ of an element is the text that is displayed between the tags. For example, the following element has "happy" as its value. - -[source,html] ----- -
happy
----- - -You can use XPath expressions to find elements by their value. For example, the following XPath expression will find all `div` elements with the value "happy". - ----- -//div[text()='happy'] ----- -==== - -Use `selenium`, and write Python code that first finds the search bar `input` element of the https://zillow.com home page. Then, use `selenium` to emulate typing the zip code `32607` into the search bar followed by a press of the enter/return button. - -Confirm your code works by printing the current URL of the page _after_ the search has been performed. What happens? - -[TIP] -==== -To print the URL of the current page, use the following code. - -[source,python] ----- -print(driver.current_url) ----- -==== - -Well, it is likely that the URL is unchanged. Remember, the "For sale", "For rent", "Skip this question" page may pop up, and this page has the _same_ URL. To confirm this, instead of printing the URL, instead print the entirety of the HTML provided above. To do so, first use xpath expressions to isolate the outermost `div` element, then print the the entire element, including all of the nested elements. - -[TIP] -==== -To print the HTML of an element using `selenium`, you can use the following code. - -[source,python] ----- -element = driver.find_element("xpath", "//some_xpath") -print(element.get_attribute("outerHTML")) ----- - -If you don't know what HTML to expect, the `html` element is a safe bet. - -[source,python] ----- -element = driver.find_element("xpath", "//html") -print(element.get_attribute("outerHTML")) ----- - -Of course, please only print the isolated element -- we don't want to print it all -- that would be a lot! -==== - -[TIP] -==== -You can use the class 'yui3-lightbox interstitial'. -==== - -[TIP] -==== -Remember, in the background, `selenium` is actually launching a browser -- just like a human would. Sometimes, you need to wait for a page to load before you can properly interact with it. It is highly recommended you use the `time.sleep` function to wait 5 seconds after a call to `driver.get` or `element.send_keys`. -==== - -[TIP] -==== -One downside to selenium is it has some more boilerplate code than, `requests`, for example. Please use the following code to instantiate your `selenium` driver on Anvil. - -[source,python] ----- -import time -import uuid -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -from selenium.webdriver.common.keys import Keys - -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -# create driver -driver = webdriver.Firefox(log_path=f"/tmp/{uuid.uuid4()}", options=firefox_options) - -# use driver here -# ... - -# close the driver -driver.quit() ----- - -Please feel free to "reset" your driver (for example, if you've lost track of "where" it is (what webpage, for example) or you aren't getting results you expected) by running the following code, followed by the code shown above. - -[source,python] ----- -driver.quit() - -# instantiate driver again ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -++++ - -++++ - -Okay, let's go forward with the assumption that we will always see the "For sale", "For rent", and "Skip this question" page. We need our code to handle this situation and click the "Skip this question" button so we can get our search results! - -Write Python code that uses `selenium` to find the "Skip this question" button and click it. Confirm your code works by printing the current URL of the page _after_ the button has been clicked. Is the URL what you expected? - -[TIP] -==== -Don't forget, it may be best to put a `time.sleep(5)` after the `click()` method call -- _before_ printing the current URL. -==== - -Uh oh! If you did this correctly, it is likely that the URL is not quite right -- something like: `https://www.zillow.com/homes/_rb/`, or maybe a captcha page. By default, websites employ a variety of anti-scraping techniques. On the bright side, we _did_ notice (when doing this search manually) that the URL _should_ look like: `https://www.zillow.com/homes/32607_rb/` -- we can just insert our zip code directly in the URL and that should work without any fuss, _plus_ we save some page loads and clicks. Great! Alternatively, you could also narrow down the search to homes "For Sale" by using `https://www.zillow.com/homes/for_sale/32607_rb/`. - -[NOTE] -==== -If you are paying close attention -- you will find that this is an inconsistency between using a browser manually and using `selenium`. `selenium` isn't saving the same data (cookies and local storage) as your browser is, and therefore doesn't "remember" the zip code you are search for after that intermediate "For sale", "For rent", and "Skip this question" step. Luckily, modifying the URL works better anyways. -==== - -Test out (using `selenium`) that simply inserting the zip code in the URL works as intended. Finding the `title` element and printing the contents should verify quickly that it works as intended. Bake this functionality into a function called `print_title` that takes a search term, `search_term`, and returns the contents of the `title` element. - -[source,python] ----- -element = driver.find_element("xpath", "//title") -print(element.get_attribute("outerHTML")) ----- - -[source,python] ----- -# make sure this works -print_title("32607") ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -++++ - -++++ - -Okay great! Take your time to open a browser to `https://www.zillow.com/homes/for_sale/32607_rb/` and use the Inspector to figure out how the web page is structured. For now, let's not worry about any of the filters. The main useful content is within the cards shown on the page. Price, number of beds, number of baths, square feet, address, etc., is all listed within each of the cards. - -What non `li` element contains the cards in their entirety? Use `selenium` and XPath expressions to extract those elements from the web page. Print the value of the `id` attributes for all of the cards. How many cards was there? (this _could_ vary depending on when the data was scraped -- that is ok). - -[TIP] -==== -You can use the `id` attribute in combination with the `starts-with` XPath function to find these elements, because each `id` starts with the same 4-5 letter prefix. - -Some examples of how to use `starts-with`: - ----- -//div[starts-with(@id, 'card_')] # all divs with an id attribute that starts with 'card_' -//div[starts-with(text(), 'okay_')] # all divs with content that starts with 'okay_' ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -How many cards were there? For me, there were just about 8. Let's verify things. Open a browser to `https://www.zillow.com/homes/for_sale/32607_rb/` and scroll down to the bottom of the page. How many cards are there? - -For me, there were _significantly_ more than 8. There were 40. Something is going on here. What is going on is lazy loading. What this means is the web page is only loading the first 8 cards, and then loading the rest of the cards as you scroll down the page. This is a common technique to reduce the initial load time of a web page. This is also the perfect scenario for us to really utilize the power of `selenium`. If we just use a package like `requests`, we are unable to scroll down the page and load the rest of the cards. - -Check out the function below called `load_all_cards` that accepts the `driver` as an argument, and scrolls down the page until all of the cards have been loaded. Examine the function and explain (in a markdown cell) what it is doing. In addition, use the function in combination with your code from the previous question to print the `id` attribute for all of the cards. How many cards were there this time? - -[source,python] ----- -def load_all_cards(driver): - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - while True: - try: - num_cards = len(cards) - driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1]) - time.sleep(2) - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - if num_cards == len(cards): - break - num_cards = len(cards) - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project04.adoc deleted file mode 100644 index c408f16a9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project04.adoc +++ /dev/null @@ -1,311 +0,0 @@ -= TDM 20200: Project 4 -- 2023 - -**Motivation:** Learning to scrape data can take time. We want to make sure you get comfortable with it! For this reason, we will continue to scrape data from Zillow to answer various questions. This will allow you to continue to get familiar with the tools, without having to re-learn everything about the website of interest. - -**Context:** This is the third project on web scraping, where we will continue to focus on honing our skills using `selenium`. - -**Scope:** Python, web scraping, selenium, matplotlib/plotly - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use the beautifulsoup4 package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -In the last question of the previous project -- we provided code that would emulate scrolling down slowing in the browser, giving all of the properties that appear in our Zillow search a chance to load. The final result was a list of `zpid_VALUE` to each of the 40 properties that appeared in our search. - -Here is the thing -- we want to get the data for all of the properties that appear in our search, not just the first 40. For example, if you load up https://www.zillow.com/homes/for_sale/32607_rb/ there are 56 listings, but only 40 appear on the first page of results. In fact, each page of results only contains 40 listings. In order to see the "next 40" listings, we need to click the "next" button at the bottom of the page, or a button that says "2", "3", "4", etc. This is a common technique websites use to break up many results into smaller, more manageable chunks. It is called "pagination". - -There are a variety of ways you can handle scraping pages that use pagination, depending on how the website is implemented. Sometimes a webpage will have a _query parameter_ that indicates what page of results are loaded. For example, you may see a page like https://example.com/?page=1. Well, if you manually change the page number from 1 to 2, it may show you the next set of results. This _can_ be the easiest way to handle pagination -- after all, if web pages are setup this way, you could use a package like `requests` without the need to utilize a browser emulator. - -Give it a shot -- can you look at the webpage and HTML and figure out if there is a way to craft the URL to display the second page of results for the zip code 32607? If you can, write a loop that scrapes both the first and second page and prints out all of the resulting `zpid_VALUE`. At the time of writing, there were about 56 total. - -[TIP] -==== -For convenience here is the rather long "setup" code to use selenium on Anvil. - -[source,python] ----- -import time -import uuid -from selenium import webdriver -from selenium.webdriver.firefox.options import Options -from selenium.webdriver.common.desired_capabilities import DesiredCapabilities -from selenium.webdriver.common.keys import Keys -from selenium.common.exceptions import StaleElementReferenceException - -firefox_options = Options() -firefox_options.add_argument("--window-size=810,1080") -# Headless mode means no GUI -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(options=firefox_options) -driver.quit() ----- -==== - -[TIP] -==== -If you hover over the page 2 button, there may be a hint at the end of the URL in the `` that would be worth adding to your URL. In Firefox, you can hover over the button and a link will appear in the lower left-hand corner of the browser. -==== - -[TIP] -==== -I wrote a function to make the solution more clear. Here is some start code you can use if you want. - -[source,python] ----- -def get_zpid(search_term: str, page: int = 1): - - def _load_all_cards(driver): - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - while True: - try: - num_cards = len(cards) - driver.execute_script('arguments[0].scrollIntoView();', cards[num_cards-1]) - time.sleep(2) - cards = driver.find_elements("xpath", "//article[starts-with(@id, 'zpid')]") - if num_cards == len(cards): - break - num_cards = len(cards) - except StaleElementReferenceException: - # every once in a while we will get a StaleElementReferenceException - # because we are trying to access or scroll to an element that has changed. - # this probably means we can skip it because the data has already loaded. - continue - - driver = webdriver.Firefox(options=firefox_options) - # TODO: add a call to driver.get here - time.sleep(5) - _load_all_cards(driver) - - # loop through cards and append zpid to results - - driver.quit() - - return results ----- - -[source,python] ----- -zpids = [] -for i in range(2): - zpids.extend(get_zpid("32607", i+1)) - -zpids ----- - -.expected output (or close) ----- -['zpid_58879763', - 'zpid_42717098', - 'zpid_54478236', - 'zpid_58879802', - 'zpid_70737074', - 'zpid_42719676', - 'zpid_2069654950', - 'zpid_42717501', - 'zpid_66690088', - 'zpid_42718511', - 'zpid_42716336', - 'zpid_42719800', - 'zpid_42718955', - 'zpid_82053142', - 'zpid_42717633', - 'zpid_42716062', - 'zpid_42717813', - 'zpid_70737079', - 'zpid_42716226', - 'zpid_42719564', - 'zpid_42719508', - 'zpid_42718336', - 'zpid_70737207', - 'zpid_2060617221', - 'zpid_87624811', - 'zpid_2059830219', - 'zpid_42716488', - 'zpid_42716708', - 'zpid_2060491271', - 'zpid_42716533', - 'zpid_333248349', - 'zpid_66702765', - 'zpid_58880069', - 'zpid_42717050', - 'zpid_42716171', - 'zpid_42717159', - 'zpid_42719707', - 'zpid_2060421486', - 'zpid_2061764814', - 'zpid_70737130', - 'zpid_2060614103', - 'zpid_138087779', - 'zpid_66695681', - 'zpid_2060102431', - 'zpid_2060614457', - 'zpid_2060772247', - 'zpid_2060613859', - 'zpid_2061808737', - 'zpid_42717815', - 'zpid_2060932429', - 'zpid_2060422629', - 'zpid_2067830782', - 'zpid_2061601655', - 'zpid_245827979', - 'zpid_2077628862', - 'zpid_42718849'] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -What we did the previous question isn't always possible. In addition, our stopping criteria is not clear. What do we mean by this? Well, how many pages are available? If there are only 2 pages, and we ask for the 3rd page, you'll notice zillow will bring you back to the first page. Modify your code from the previous question to handle this situation. Package everything into a nice function called `get_zpids` that takes a search term and returns a list of all the zpids for all of the pages, no matter how many there are. - -[WARNING] -==== -I would recommend adding the following print statement at the beginning of the `_load_all_cards` function. - -[source,python] ----- -print(driver.current_url) ----- - -Why? It _may_ turn out that this strategy of navigating to the next page is not a good one. If you get a captcha page, this can be a sign to change your strategy. Note that if you do, this is OK, just print the current url so that we can see it is a captcha page, and move on to the next question. -==== - -[TIP] -==== -Notice how if you are on the _last_ page of listings, the "next" arrow is greyed out. Use the browsers inspector to investigate. What is the attribute that causes the button to be greyed out? You can use this to determine if you are on the last page. -==== - -[TIP] -==== -[source,python] ----- -get_zpids("32607") ----- - -.expected output (or close) ----- -['zpid_58879763', - 'zpid_42717098', - 'zpid_54478236', - 'zpid_58879802', - 'zpid_70737074', - 'zpid_42719676', - 'zpid_2069654950', - 'zpid_42717501', - 'zpid_66690088', - 'zpid_42718511', - 'zpid_42716336', - 'zpid_42719800', - 'zpid_42718955', - 'zpid_82053142', - 'zpid_42717633', - 'zpid_42716062', - 'zpid_42717813', - 'zpid_70737079', - 'zpid_42716226', - 'zpid_42719564', - 'zpid_42719508', - 'zpid_42718336', - 'zpid_70737207', - 'zpid_2060617221', - 'zpid_87624811', - 'zpid_2059830219', - 'zpid_42716488', - 'zpid_42716708', - 'zpid_2060491271', - 'zpid_42716533', - 'zpid_333248349', - 'zpid_66702765', - 'zpid_58880069', - 'zpid_42717050', - 'zpid_42716171', - 'zpid_42717159', - 'zpid_42719707', - 'zpid_2060421486', - 'zpid_2061764814', - 'zpid_70737130', - 'zpid_2060614103', - 'zpid_138087779', - 'zpid_66695681', - 'zpid_2060102431', - 'zpid_2060614457', - 'zpid_2060772247', - 'zpid_2060613859', - 'zpid_2061808737', - 'zpid_42717815', - 'zpid_2060932429', - 'zpid_2060422629', - 'zpid_2067830782', - 'zpid_2061601655', - 'zpid_245827979', - 'zpid_2077628862', - 'zpid_42718849'] ----- -==== - -[TIP] -==== -One potential way to handle the control flow is to use an infinite while loop. You can use the `break` statement to exit the loop if some criteria is met, otherwise, you can use the `continue` statement to skip the rest of the loop (if there is any) and go to the next iteration. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -So far, pretty cool (or maybe disappointing, depending on your results)! Being able to navigate pagination programmatically is important. As it turns out, its not normal for a human to do the equivalent of typing `page=1`, `page=2`, etc., and then clicking enter to navigate to the next page. As such, it is likely you received a captcha page. One way to potentially handle this is to delete all of the cookies from the browser right before you are about to navigate to the next page. - -Modify you code from the previous question to clear the cookies prior to navigating to the next page. If you previously received a captcha page, does it work now? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Even though our cookie delete trick _should_ have worked from the previous question -- depending on how the website is setup, it may not. Another way we could make our behavior more human-like would be to click the "next" button instead of doing the equivalent of typing in `page=2` with the URL and hitting enter. - -Modify you code from the previous question to _click_ the "next" button on the page to navigate to the next page, instead of using `driver.get` to navigate to the next page. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project05.adoc deleted file mode 100644 index fb2eac6b4..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project05.adoc +++ /dev/null @@ -1,140 +0,0 @@ -= TDM 20200: Project 5 -- 2023 - -**Motivation:** Learning to scrape data can take time. We want to make sure you get comfortable with it! For this reason, we will continue to scrape data from Zillow to answer various questions. This will allow you to continue to get familiar with the tools, without having to re-learn everything about the website of interest. - -**Context:** This is the fourth project on web scraping, where we will continue to focus on honing our skills using selenium. - -**Scope:** Python, web scraping, selenium, matplotlib/plotly - -.Learning Objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Use the requests package to scrape a web page. -- Use the lxml package to filter and parse data from a scraped web page. -- Use the beautifulsoup4 package to filter and parse data from a scraped web page. -- Use selenium to interact with a browser in order to get a web page to a desired state for scraping. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -In the previous couple of projects, you've been utilizing `selenium` to scrape `zpid` information from Zillow. This begs the question -- how can we use those values? For example, the following is a sample of data you may have received from one of your scraping functions. - -.example output ----- -['zpid_58879763', - 'zpid_42717098', - 'zpid_54478236'] ----- - -How can you use these? Well, what about just pasting the value into the URL, maybe: https://zillow.com/homes/for_sale/zpid_58879763/. What happens? Well, it doesn't really work, unfortunately. However, if you browse Zillow a bit, you may realize that when you click on individual properties, the number and "zpid" part are reversed. Let's test this out: https://zillow.com/homes/for_sale/58879763_zpid/. It works! This pulls up a webpage containing all of the details of the property! That is pretty cool. - -Take your favorite version of the `get_zpids` function from the previous project, and tweak it slightly, so instead of returning lists of `zpid` values like: - -.example output ----- -['zpid_58879763', - 'zpid_42717098', - 'zpid_54478236'] ----- - -Instead, return a list of URLs like: - -.example output ----- -['https://zillow.com/homes/for_sale/58879763_zpid/', - 'https://zillow.com/homes/for_sale/42717098_zpid/', - 'https://zillow.com/homes/for_sale/54478236_zpid/'] ----- - -Great! Now, you can search Zillow, and get a (hopefully) robust list of property links that match your search term! - -[source,python] ----- -# example usages -get_zpids("West Lafayette, IN") -get_zpids("47933") -get_zpids("apple") ----- - -Test it out on a search term or two (in your Jupyter Notebook) and make sure it behaves like you expect. Test out a resulting link to make sure you get to see the details page of the property. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -One of the fun parts of scraping, we have largely taken away from you so far this semester. Scraping is fun because different people have different interests, and scraping allows you to assemble datasets based around those interests, which you can then analyze. We will give you more freedom to do this next project, and a _little_ bit more starting now. - -Scrape any data you want from at least 50 Zillow detail pages that you find interesting, and put together a fun analysis. This does not need to be anything extensive (unless you want it to be), but worthy of the 1 credit hour of work for seminar. - -Start with this question. Pose a hypothesis about the data on the zillow details pages that you are interested in. For example, you could hypothesize that home values have risen by a larger rate in one zip code to another. Alternatively, you could also pose a challenge. For example, you could say you want to build a model that predicts the price of a home in X years based on the current price, and other features. - -Basically, just come up with something that you want to try and answer or do. Then, in a markdown cell, write it out in enough detail for someone to easily understand what you are trying to do. - -[IMPORTANT] -==== -The point of this project is _not_ to have good results or accurate conclusions. The point is to think, explore your hard-earned data you scraped, and maybe take the opportunity to learn something new. We have all sorts of backgrounds in The Data Mine -- you may feel comfortable building some sort of model, or maybe not at all! That's okay! You can still do something cool with the data you scraped, or put together a completely anecdotal analysis with some neat visualizations. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Scrape the required data from the Zillow detail pages. This will be the data you will use to try and answer your question or use to build your model. From within your Jupyter Notebook, scrape the data and display a sample of it. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -For this question, if you chose to do something like build a model, build the model. It doesn't need to be good or fit the data well, just do your best to create _something_ you can feed your scraped data to and get a potentially cool result. - -If you went another route, like trying to demonstrate a hypothesis, then do that. Use any tool you want for this, but make sure to include the code you used to do it, and demonstrate how you are using the data you scraped to try and answer your question. - -For both, please include a markdown cell that briefly summarizes what you did or tried to do. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Finally, create at least 1 graphic to support your work. If you did something like built a model, maybe you could create a graphic that shows the historical price of a property and then uses your model to project the price X years into the future. - -If you did something like tried to demonstrate a hypothesis, maybe you could create a graphic that shows the price of a property in one zip code compared to another, and then show the difference in price over time. - -For both, please include a markdown cell that briefly explains what, if any, conclusions you draw, and how, if you had the time, you would try to improve your work. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project06.adoc deleted file mode 100644 index 375404c54..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project06.adoc +++ /dev/null @@ -1,223 +0,0 @@ -= TDM 20200: Project 6 -- 2023 - -**Motivation:** This project we are continuing to focus on your web scraping skills, introducing you to some common deceptions, and give you the chance to apply what you've learned to something you are interested in. - -**Context:** This is the last project focused on web scraping. -We have created a few problematic situations that can come up when you are first learning to scrape data. Our goal is to share some tips on how to get around the issue. - -**Scope:** Python, web scraping - -.Learning Objectives -**** -- Use the requests package to scrape a web page. -- Use the lxml/selenium package to filter and parse data from a scraped web page. -- Learn how to step around header-based filtering. -- Learn how to handle rate limiting. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -We have setup 4 special URLs for you to use: https://static.the-examples-book.com, https://ip.the-examples-book.com, https://header.the-examples-book.com, https://sheader.the-examples-book.com. Each website uses different methods to _rate limit_ traffic. - -Rate limiting is an issue that comes up often when scraping data. When a website notices that the pages are being navigated faster than humans are able to, the speed causes the website to throw a red flag to the website host that some one could be scraping their content. This is when the website will introduce a rate limit to prevent web scraping. - -If you are able to open a browser and navigate to https://sheader.the-examples-book.com. Once there you should be presented with some basic information about the request. - -Now let us check to see what happens if you open up your Jupyter notebook, import the `requests` package and scrape the webpage. - -You _should_ be presented with HTML that indicates your request was blocked. - -https://sheader.the-examples-book.com is designed to block all requests where the User-Agent header has "requests" in it. By default, the `requests` package will use the User-Agent header with a value of "python-requests/2.26.0", which has "requests" in it. - -Backing up a little bit, _headers_ are part of your _request_. In general, you can think of headers as some extra data that gives the server or client some context about the request. You can read about headers https://developer.mozilla.org/en-US/docs/Glossary/Request_header[here]. You can find a list of the various headers https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers[here]. - -Each header has a purpose. One common header is called https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent[User-Agent]. A user-agent is something like: - ----- -User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:86.0) Gecko/20100101 Firefox/86.0 ----- - -From the Mozilla link, this header is a string that "lets servers and network peers identify the application, operating system, and browser that is requesting a resource." Basically, if you are browsing the internet with a browser like Firefox or Chrome, the server will know which browser you are using. In the provided example, we are using Firefox 86 from Mozilla, on a Mac running Mac OS 10.16 with an Intel processor. - -When making a request using the `requests` package, the following is what the headers look like. - -[source,python] ----- -import requests - -response = requests.get("https://sheader.the-examples-book.com") -print(response.request.headers) ----- - -.Output ----- -{'User-Agent': 'python-requests/2.28.2', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'} ----- - -As you can see, the User-Agent header has the word "requests" in it, so it will be blocked. - -You can set the headers to be whatever you'd like using the `requests` package. Simply pass a dictionary containing the headers you'd like to use to the `headers` argument of `requests.get`. Modify the headers so you are able to scrape the response. Print the response using the following code. - -[source,python] ----- -my_response = requests.get(...) -print(my_response.text) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -++++ - -++++ - -Navigate to https://header.the-examples-book.com. Refresh the page a few times, as you do you will notice how the "Cf-Ray" header is changes. -Write a function called `get_ray` that accepts a url as an argument, and scrapes the _value_ of the Cf-Ray header and return the text. - -Run the following code. - -[source,python] ----- -for i in range(6): - print(get_ray('https://header.the-examples-book.com')) ----- - -What happens then? Now pop open the webpage in a browser and refresh the page 6 times in rapid succession, what do you see? - -Run the following code again, but this time use a different header. - -[source,python] ----- -for i in range(6): - print(get_ray('https://header.the-examples-book.com', headers={...})) ----- - -This website is designed to adapt and block requests if they have the same header and make requests too quickly. Create a https://github.com/tamimibrahim17/List-of-user-agents[list] of valid user agents and modify your code to utilize them to get 50 "Cf-Ray" values rapidly (in a loop). - -[TIP] -==== -You may want to modify `get_ray` to accept a `headers` argument. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Navigate to https://ip.the-examples-book.com. This page is designed to allow only 5 requests every minute from a single IP address. To verify that this is true go ahead and rapidly refresh the page 6 times in a row, then (without wifi) try to load the page on your cell phone immediately after. You will notice that the cell phone loads, but the browser doesn't. - -IP blocking is one of the most common ways to block traffic. Websites will monitor web activity and use complicated algorithms to block IP addresses that appear to be scraping data. The solution will be that we need to scrape content at a certain pace, or figure out a way to use different IP addresses. - -Simply scraping content at a certain pace will not work. Unfortunately even if we randomize periods of time between scraping values, algorithms that are used are clever. - -The best way to bypass IP blocking is to use a different IP address. We can accomplish this by using a proxy server. A proxy server is another computer that will pass the request on for you. The relayed request is now made from behind the proxy servers IP address. - -The following code attempts to scrape some free proxy servers. - -[source,python] ----- -import lxml.html - -def get_proxies(): - url = "https://www.sslproxies.org/" - resp = requests.get(url) - root = lxml.html.fromstring(resp.text) - trs = root.xpath("//div[contains(@class, 'fpl-list')]//table//tr") - proxies_aux = [] - for e in trs[1:]: - ip = e.xpath(".//td")[0].text - port = e.xpath(".//td")[1].text - proxies_aux.append(f"{ip}:{port}") - - proxies = [] - for proxy in proxies_aux[:25]: - proxies.append({'http': f'http://{proxy}', 'https': f'http://{proxy}'}) - - return proxies ----- - -Play around with the code and test proxy servers out until you find one that works. The following code should help get you started. - -[source,python] ----- -p = get_proxies() -resp = requests.get("https://ip.the-examples-book.com", proxies=p[0], verify=False, headers={'User-Agent': f"{my_user_agents[0]}"}, timeout=15) -print(resp.text) ----- - -A couple of notes: - -- `timeout` is set to 15 seconds, because it is likely the proxy will not work if it takes longer than 15 seconds to respond. -- We set a user-agent header so some proxy servers won't automatically block our requests. - -You can stop once you receive and print a successful response. As you will see, unless you pay for a working set of proxy servers, it is very difficult to combat having your IP blocked. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -++++ - -++++ - -Test out https://static.the-examples-book.com. This page is designed to only allow x requests per period of time, regardless of the IP address or headers. - -Write code that scrapes 50 Cf-Ray values from the page. If you attempt to scrape them too quickly, you will get an error. Specifically, `response.status_code` will be 429 instead of 200. - -[source,python] ----- -resp = requests.get("https://static.the-examples-book.com") -resp.status_code # will be 429 if you scrape too quickly ----- - -Different websites have different rules, one way to counter this defense is by exponential backoff. Exponential backoff is a system whereby you scrape a page until you receive some sort of error, then you wait x seconds before scraping again. Each time you receive an error, the wait time increases exponentially. - -There is a really cool package that does this for us! Use the https://pypi.org/project/backoff/[backoff] package to accomplish this task. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -For full credit you can choose to do either option 1 or option 2. - -**Option 1:** Figure out how many requests (_r_) per time period (_p_) you can make to https://static.the-examples-book.com. Keep in mind that the server will only respond to _r_ requests per time period (_p_) -- this means that fellow students requests will count towards the quota. Figure out _r_ and _p_. Answers do not need to be exact. - -**Option 2:** Use your skills to scrape data from a website we have not yet scraped. Once you have the data create something with it, you can create a graphic, perform some sort of analysis etc. The only requirement is that you scrape at least 100 "units". A "unit" is 1 thing you are scraping. For example, if scraping baseball game data, I would need to scrape the height of 100 players, or the scores of 100 games, etc. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project07.adoc deleted file mode 100644 index f6355a75c..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project07.adoc +++ /dev/null @@ -1,112 +0,0 @@ -= TDM 20200: Project 7 -- 2023 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we are going to dive into plotting using `plotly` with an open project. - -**Context:** We've been working hard all semester and learning a lot about web scraping. In this project we are going to ask you to examine some plots, write a little bit, and use your creative energies to create good visualizations about the flight data using the go-to plotting library for many, `plotly`. In the next project, we will continue to learn about and become comfortable using `plotly`. - -**Scope:** Python, plotly, visualizations - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/*.csv` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -http://stat-computing.org/dataexpo/2009/posters/[Here] is the description of the 2009 Data Expo poster competition. The object of the competition was to visualize interesting information from the flights dataset. - -The winner of the competition were: - -- First place: https://llc.stat.purdue.edu/airline/wicklin-allison.pdf[Congestion in the sky: Visualising domestic airline traffic with SAS (pdf, 550k)] Rick Wicklin and Robert Allison, SAS Institute. - -- Second place: https://llc.stat.purdue.edu/airline/hofmann-cook.pdf[Delayed, Cancelled, On-Time, Boarding ... Flying in the USA (pdf, 13 meg)] Heike Hofmann, Di Cook, Chris Kielion, Barret Schloerke, Jon Hobbs, Adam Loy, Lawrence Mosley, David Rockoff, Yuanyuan Huang, Danielle Wrolstad and Tengfei Yin, Iowa State University. - -- Third place: https://llc.stat.purdue.edu/airline/wickham.pdf[A tale of two airports: An exploration of flight traffic at SFO and OAK. (pdf, 770k)] Charlotte Wickham, UC Berkeley. - -- Honourable mention: https://llc.stat.purdue.edu/airline/dey-phillips-steele.pdf[Minimizing the Probability of Experiencing a Flight Delay (pdf, 7 meg)] Tanujit Dey, David Phillips and Patrick Steele, College of William & Mary. - -The other posters were: - -- https://llc.stat.purdue.edu/airline/kane-emerson.pdf[The Airline Data Set... What's the big deal? (pdf, 80k)] Michael Kane and Jay Emerson, Yale. - -- https://llc.stat.purdue.edu/airline/sun.pdf[Make a Smart Choice on Booking Your Flight! (pdf, 2 meg)] Yu-Hsiang Sun, Case Western Reserve University. - -- https://llc.stat.purdue.edu/airline/crotty.pdf[Airline Data for Raleigh-Durham International] Michael T. Crotty, SAS Institute Inc. - -- https://llc.stat.purdue.edu/airline/jiang.pdf[What Airlines Would You Avoid for Your Next Flight?] Haolai Jiang and Jung-Chao Wang, Western Michigan University. - -Examine all 8 posters and write a single sentence for each poster with your first impression(s). An example of an impression that will not get full credit would be: "My first impression is that this poster is bad and doesn't look organized.". An example of an impression that will get full credit would be: "My first impression is that the author had a good visualization-to-text ratio and it seems easy to follow along.". - -.Items to submit -==== -- 8 bullets, each containing a sentence with the first impression of the 8 visualizations. Order should be "first place", to "honourable mention", followed by "other posters" in the given order. Or, label which graphic each sentence is about. -==== - -=== Question 2 - -https://www.amazon.com/dp/0985911123/[Creating More Effective Graphs] by Dr. Naomi Robbins and https://www.amazon.com/dp/0963488414/[The Elements of Graphing Data] by Dr. William Cleveland at Purdue University, are two excellent books about data visualization. Read the following excerpts from the books (respectively), and list 2 things you learned, or found interesting from each book. - -- https://thedatamine.github.io/the-examples-book/files/CreatingMoreEffectiveGraphs.pdf[Excerpt 1] -- https://thedatamine.github.io/the-examples-book/files/ElementsOfGraphingData.pdf[Excerpt 2] - -.Items to submit -==== -- Two bullets for each book with items you learned or found interesting. -==== - -=== Question 3 - -Of the 7 posters with at least 3 plots and/or maps, choose 1 poster that you think you could improve upon or "out plot". Create 4 plots/maps that either: - -. Improve upon a plot from the poster you chose, or -. Show a completely different plot that does a good job of getting an idea or observation across, or -. Ruin a plot. Purposefully break the best practices you've learned about in order to make the visualization misleading. (limited to 1 of the 4 plots) - -For each plot/map where you choose to do (1), include 1-2 sentences explaining what exactly you improved upon and how. Point out some of the best practices from the 2 provided texts that you followed. - -For each plot/map where you choose to do (2), include 1-2 sentences explaining your graphic and outlining the best practices from the 2 texts that you followed. - -For each plot/map where you choose to do (3), include 1-2 sentences explaining what you changed, what principle it broke, and how it made the plot misleading or worse. - -While we are not asking you to create a poster, please use Jupyter notebooks to keep your plots, code, and text nicely formatted and organized. The more like a story your project reads, the better. In this project, we are restricting you to use https://plotly.com/python/plotly-express/[]`plotly`] in Python. While there are many interesting plotting packages like `matplotlib` and `plotnine`, we really want you to take the time to dig into `plotly` and learn as much as you can. - -.Items to submit -==== -- All associated Python code you used to wrangling the data and create your graphics. -- 4 plots (and the Python code to produce the plots). -- 1-2 sentences per plot explaining what exactly you improved upon, what best practices from the texts you used, and how. If it is a brand new visualization, describe and explain your graphic, outlining the best practices from the 2 texts that you followed. If it is the ruined plot you chose, explain what you changed, what principle it broke, and how it made the plot misleading or worse. -==== - -=== Question 4 - -Now that you've been exploring data visualization, copy, paste, and update your first impressions from question (1) with your updated impressions. Which impression changed the most, and why? - -.Items to submit -==== -- 8 bullets with updated impressions (still just a sentence or two) from question (1). -- A sentence explaining which impression changed the most and why. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project08.adoc deleted file mode 100644 index 028c276f6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project08.adoc +++ /dev/null @@ -1,132 +0,0 @@ -= TDM 20200: Project 8 -- 2023 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we will create plots using the `plotly` package, as well as do some data manipulation using `pandas`. - -**Context:** This is the second project focused around creating visualizations in Python. - -**Scope:** plotly, Python, pandas - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/disney/total.parquet` -- `/anvil/projects/tdm/data/disney/metadata.csv` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -Read in the data from the parquet file `/anvil/projects/tdm/data/disney/total.parquet` and store it in a variable called `dat`. Do the same for `/anvil/projects/tdm/data/disney/metadata.csv` and call it `meta`. - -Plotly express makes it really easy to create nice, clean graphics, and it integrates with `pandas` superbly. You can find links to all of the plotly express functions on https://plotly.com/python/plotly-express/[this] page. - -Let's start out simple. Create a bar chart for the total number of observations for each ride. Make sure your plot has labels for the x axis, y axis, and overall plot. - -[WARNING] -==== -While the default plotly plots look amazing and have great interactivity, they won't render in your notebook well in Gradescope. For this reason, please use `fig.show(renderer="jpg")` for all of your plots, otherwise they will not show up in gradescope and you will not get full credit. -==== - -[TIP] -==== -You can use `fig.update_xaxes` to make your x axis labels be at an angle and reduce the font size so that they are smaller. Search for "Set axis label rotation and font" on https://plotly.com/python/axes/[this page]. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Great! Wouldn't it be interesting to see how the total number of observations changes over time, by ride? - -Create a single plot that contains a bar plot for all of the rides, with the total number of observations for each ride by year. Okay, maybe not _all_ of the rides -- if you try, you may notice that the graphic becomes quite cluttered. Instead choose 6 rides you are interested to include on the plot. - -[TIP] -==== -https://plotly.com/python/bar-charts/[This] page has a good example of making facetted subplots. -==== - -[TIP] -==== -To convert the `datetime` column to a datetime type, you can use `dat["datetime"] = pd.to_datetime(dat["datetime"])`. -==== - -[TIP] -==== -First, create a new column called `year` based on the `year` from the `datetime` column. - -Next, group by both the `ride_name` and `year` columns. Use the `count` method to get the total number of observations for each combination of `ride_name` and `year`. After that, use the `reset_index` method so that both `ride_name` and `year` become columns again (instead of indices). - -The x axis should be the `year`, y axis could be `datetime` (which actually contains the _count_ of observations), the color argument should be `year`, `facet_col` should be `ride_name`, and you can limit the number of plots per column by specifying `facet_col_wrap` to be 3 (for 3 plots per row). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Create a plot that shows the association between the average `SPOSTMIN` and `WDWMAXTEMP` for a ride of your choice. Some notes to help you below. - -. Create a new column in `dat` called `day` that is the date. Make sure to use `pd.to_datetime` to convert the date to the correct type. -. Use the `groupby` method to group by both `ride_name` and `day`. Get the average by using the `mean` method on your grouped data. In order to make `ride_name` and `day` columns instead of indices, call the `reset_index` method. Finally, use the `query` method to subset your data to just be data for your given ride. -. Convert the `DATE` column in `meta` to the correct type using `pd.to_datetime`. -. Use the `merge` method to merge your grouped data with the metadata on `day` (from the grouped data) and `DATE` (from `meta`). -. Make the scatterplot. - -Is there an obvious relationship between the two variables for your chosen ride? Did you expect the results? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -This is an extremely rich dataset with lots and lots of plotting potential! In addition, there are a lot of interesting questions you could ask about wait times and rides that could actually be useful if you were to visit Disney! - -Create a graphic using a plot we have not yet used from https://plotly.com/python/plotly-express/[this] webpage. Make sure to use proper labels, and make sure the graphic shows some sort of _potentially_ interesting relationship. Write 1-2 sentences about why you decided to create this plot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Ask yourself a question regarding the information in the dataset. For example, maybe you think that certain events from the `meta` dataframe will influence a certain ride. Perhaps you think the time the park opens is relevant to the time of year? Write down the question you would like to answer using a plot. Choose the type of plot you are going to use, and write 1-2 sentences explaining your reasoning. Create the plot. What were the results? Was the plot an effective way to answer your question? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project09.adoc deleted file mode 100644 index cafadc42f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project09.adoc +++ /dev/null @@ -1,104 +0,0 @@ -= TDM 20200: Project 9 -- 2023 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we will create plots using the `plotly` package, as well as do some data manipulation using `pandas`. - -**Context:** This is the second project focused around creating visualizations in Python. - -**Scope:** plotly, Python, pandas - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/open_food_facts/openfoodfacts.tsv` -- `/anvil/projects/tdm/data/restaurant/test_full.csv` -- `/anvil/projects/tdm/data/stackoverflow/unprocessed/2021.csv` -- `/anvil/projects/tdm/data/disney/total.parquet` -- `/anvil/projects/tdm/data/beer/*.parquet` - - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -Read `/anvil/projects/tdm/data/open_food_facts/openfoodfacts.tsv` into a `pandas` dataframe, and create a bar plot using `plotly` that shows the 10 foods with the largest carbon footprint. The x-axis should be the food name, and the y-axis should be the carbon footprint value. Notice anything odd about the results? How does `plotly` handle when there are identical names in the x-axis? - -[WARNING] -==== -Make sure to use `fig.show(renderer='jpg')` to display your plot, otherwise, the graders will not be able to see your plot, and you will lose credit. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -There are 4 other specified datasets for this project. Choose one that you have not yet chosen for a previous question, wrangle the data, and use `plotly` to create a graphic that is _completely new_ from any other graphic you have created in this project. For example, for this question, you can no longer use the `openfoodfacts.tsv` dataset, and you can no longer use a bar plot. - -The resulting plot _must_ be refined -- it should have a proper, cleaned up title, proper, cleaned up x and y-axis labels, and the plot should be easy to read. Add a single markdown cell describing what the plot is showing, and what you learned from it (if anything). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -There are 4 other specified datasets for this project. Choose one that you have not yet chosen for a previous question, wrangle the data, and use `plotly` to create a graphic that is _completely new_ from any other graphic you have created in this project. - -The resulting plot _must_ be refined -- it should have a proper, cleaned up title, proper, cleaned up x and y-axis labels, and the plot should be easy to read. Add a single markdown cell describing what the plot is showing, and what you learned from it (if anything). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -There are 4 other specified datasets for this project. Choose one that you have not yet chosen for a previous question, wrangle the data, and use `plotly` to create a graphic that is _completely new_ from any other graphic you have created in this project. For this question **please choose a plot from the "1D Distributions" section on https://plotly.com/python/plotly-express/[this page]**. - -The resulting plot _must_ be refined -- it should have a proper, cleaned up title, proper, cleaned up x and y-axis labels, and the plot should be easy to read. Add a single markdown cell describing what the plot is showing, and what you learned from it (if anything). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -There are 4 other specified datasets for this project. Choose one that you have not yet chosen for a previous question, wrangle the data, and use `plotly` to create a graphic that is _completely new_ from any other graphic you have created in this project. For this question **please choose a plot from any of the sections below (and including) the "3-Dimensional" section on https://plotly.com/python/plotly-express/[this page]**. - -The resulting plot _must_ be refined -- it should have a proper, cleaned up title, proper, cleaned up x and y-axis labels, and the plot should be easy to read. Add a single markdown cell describing what the plot is showing, and what you learned from it (if anything). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project10.adoc deleted file mode 100644 index e5e7bcc46..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project10.adoc +++ /dev/null @@ -1,264 +0,0 @@ -= TDM 20200: Project 10 -- 2023 - -**Motivation:** The use of a suite of packages referred to as the `tidyverse` is popular with many R users. It is apparent just by looking at `tidyverse` R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We've covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/beer/beers.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -== Questions - -The "tidyverse" consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `magrittr`, `purrr`, `tibble`, `stringr`, and `lubridate`. - -One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham's book, https://r4ds.had.co.nz/[R for Data Science]. - -There is an excellent graphic https://r4ds.had.co.nz/introduction.html[here] that illustrates a general workflow for data science projects: - -. Import -. Tidy -. Iterate on, to gain understanding: -.. Transform -.. Visualize -.. Model -. Communicate - -This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change. - -=== Question 1 - -[WARNING] -==== -You may want to use the `f2022-s2023-r` kernel instead of the `f2022-s2023` kernel. The `f2022-s2023-r` kernel will by default run `R` code instead of `Python` code, so instead of: - -[source,ipython] ----- -%%R - -library(tidyverse) ----- - -You could just do: - -[source,ipython] ----- -library(tidyverse) ----- -==== - -The first step in our workflow is to read the data. - -Read the datasets `beers.csv` and `reviews_sample.csv` using the `read_csv` function from `tidyverse` into tibbles called `beers` and `reviews`, respectively. - -[NOTE] -==== -"Tibble" are essentially the `tidyverse` equivalent to data.frames. They function slightly differently, but are so similar (today) that we won't go into detail until we need to. -==== - -In projects 10 and 11, we want to analyze and compare different beers. Note, that in `reviews` each row corresponds to a review by a certain user on a certain date. As reviews likely vary by individuals, we may want to summarize our `reviews` tibble. - -To do that, let's start by deciding how we are going to summarize the reviews. Start by picking one of the variables (columns) from the `reviews` dataset to be our "beer goodness indicator". For example, maybe you believe that the `taste` is important in beverages (seems reasonable). - -Now, determine a summary statistic that we will use to compare beers based on your beer goodnees indicator variable. Examples include `mean`, `median`, `std`, `max`, `min`, etc. Write 1-2 sentences describing why you chose the statistic you chose for your variable(s). You can use annectodal evidence (some reasoning why you think that summary statistics would be appropriate/useful here), or look at the distribution based on plots, or summary statistics to pick your preferred summary statistics for this case. - -[NOTE] -==== -If you are making a plot, please be sure to use the `ggplot` package. -==== - -[NOTE] -==== -If you wanted to have some fun, you could decide to combine different variables into a single one. For instance, maybe you want to take into consideration both `taste` and `smell`, but you want a smaller weight for `smell`. Then, you create a plot of `taste + .5*smell`, and you notice the data is skewed, so you decide to go with the `median`, namely, with `median(taste+.5*smell)`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences describing what is your `beer_goodness_indicator` (variable and summary statistics), and why. -==== - -=== Question 2 - -Now that we have decided how to compare beers, let's create a new variable called `beer_goodness_indicator` in the reviews dataset. For each `beer_id`, https://dplyr.tidyverse.org/reference/summarise.html?q=summarize#ref-usage[`summarize`] the `reviews` data to get a single `beer_goodness_indicator` based on your answer from question 1. Call this summarized dataset `reviews_summary`. - -[TIP] -==== -`reviews_summary` should be 41822x2 (rows x columns). -==== - -[TIP] -==== -`summarize` is good when you want to keep your data grouped -- it will result in a data.frame with a different number of rows and columns. `mutate` is very similar except it will maintain the original columns, add a new column where the grouped/summarized values are repeated based on the variable the data was grouped by. This may be confusing, but run the following two examples and this will be made clear. - -[source,r] ----- -mtcars %>% - group_by(cyl) %>% - summarize(mpg_mean = mean(mpg)) ----- - -[source,r] ----- -mtcars %>% - group_by(cyl) %>% - mutate(mpg_mean = mean(mpg)) ----- -==== - -[TIP] -==== -You may be wondering what the heck the `%>%` part of the code from the previous tip is. These are pipes from the `magrittr` package. This is used to together functions. For example, `group_by` and `summarize` are two functions that can be chained together. You are passing the output from the previous function as the input to the next function. You'll find this is a very clean and convenient way to express a lot of very common data wrangling tasks! - -It could be as simple as getting the `head` of a dataframe. - -[source,r] ----- -head(mtcars) ----- - -You could instead use pipes: - -[source,r] ----- -mtcars %>% - head() ----- - -Why? This second version is arguably easier to read, and it is easier to edit. You could easily want to add a column to the dataframe first. - -[source,r] ----- -mtcars %>% - mutate(my_new_column = mean(cyl)) %>% - head() ----- - -Now, if we had the non-piped version it would be something like: - -[source,r] ----- -mtcars <- mtcars %>% - mutate(my_new_column = mean(cyl)) - -head(mtcars) ----- - -Or an even better example would be: - -[source,r] ----- -mtcars %>% - round() %>% - head() ----- - -Versus: - -[source,r] ----- -head(round(mtcars)) ----- -==== - -[TIP] -==== -`mutate` in particular is extremely useful. Try to perform the same operation using `pandas` and you will quickly realize how _nice_ some of the `tidyverse` functionality is. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Head of `reviews_summary` dataset. -==== - -=== Question 3 - -Let's combine our `beers` dataset with `reviews_summary` into a new dataset called `beers_reviews` that contains only beers that appears in *both* datasets. Use the appropriate https://dplyr.tidyverse.org/articles/two-table.html?q=left_join#types-of-join[`join`] function from `tidyverse` (`inner_join`, `left_join`, `right_join`, or `full_join`) to solve this problem. Since you saw some examples using pipes in the previous question (`%>%`) -- use pipes from here on out. - -[TIP] -==== -https://dplyr.tidyverse.org/reference/mutate-joins.html[This webpage] is a great website for learning about the different 'join' functions in the tidyverse! -==== - -What are the dimensions of the resulting `beers_reviews` dataset? How many beers did _not_ appear in both datasets? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Result of running `dim(beers_reviews)` -==== - -=== Question 4 - -Ok, now we have the dataset ready to analyze! For beers that are available during the entire year (see `availability`), is there a difference between `retired` and not retired beers in terms of `beer_goodness_indicator`? - -1. Start by subsetting the dataset using https://dplyr.tidyverse.org/reference/filter.html[`filter`]. -2. Create some data-driven method to answer this question. You can make a plot, get summary statistics (average `beer_goodness_indicator`, table comparing # of beers with `beer_goodness_indicator` > 4 for each category, etc). You can use multiple methods to answer this question! Have fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences answering the comparing `retired` and not retired beers in terms of `beer_goodness_indicator` based on your chosen method(s). Did the results surprise you? -- 1-2 sentences explaining what data-driven method(s) you decided to use and why. -==== - -=== Question 5 - -Let's compare different styles of beer based on our `beer_goodness_indicator` average. Create a Cleveland dotplot (using `ggplot`) comparing the average `beer_goodness_indicator` for each style in `beers_reviews`. Make sure to use the `tidyverse` functions to answer this question and to use `ggplot`. - -[TIP] -==== -The code below creates a Cleveland dotplot comparing `Sepal.Length` variation per `Species` using the `iris` dataset. - -[source,r] ----- -iris %>% - group_by(Species) %>% - summarize(petal_length_var = sd(Petal.Length)) %>% - arrange(desc(petal_length_var)) %>% -ggplot() + - geom_point(aes(x = Species, y = petal_length_var)) + - coord_flip() + - theme_classic() + - labs(x = "Petal length variation") ----- -==== - -[TIP] -==== -You can use the function https://dplyr.tidyverse.org/reference/top_n.html?q=top_n#null[`top_n(x)`] in combination with https://dplyr.tidyverse.org/articles/grouping.html?q=arrange#arrange[`arrange`] to subset to show only the top x styles. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project11.adoc deleted file mode 100644 index 8ebd1306b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project11.adoc +++ /dev/null @@ -1,177 +0,0 @@ -= TDM 20200: Project 11 -- 2023 - -**Motivation:** Data wrangling is the process of gathering, cleaning, structuring, and transforming data. Data wrangling is a big part in any data driven project, and sometimes can take a great deal of time. `tidyverse` is a great, but opinionated, suite of integrated packages to wrangle, tidy and visualize data. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed -- you may even find that you enjoy using them! - -**Context:** We have covered a few topics on the `tidyverse` packages, but there is a lot more to learn! We will continue our strong focus on the tidyverse (including ggplot) and data wrangling tasks. This is the second in a series of 5 projects focused around using `tidyverse` packages to solve data-driven problems. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/beer/beers.csv` -- `/anvil/projects/tdm/data/beer/reviews_sample.csv` - -== Questions - -=== Question 1 - -Let's pick up where we left in the previous project. Copy and paste your commands from questions 1 to 3 that result in our `beers_reviews` dataset. - -Using the pipelines (remember, the `%>%`), combine the necessary parts of questions 2 and 3, removing the need to have an intermediate `reviews_summary` dataset. This is a great way to practice and get a better understanding of `tidyverse`. - -Your code should read the datasets, summarize the reviews data similarly to what you did in question 2, and combine the summarized dataset with the `beers` dataset. This should all be accomplished from a single chunk of "piped-together" code. - -[TIP] -==== -Feel free to remove the `reviews` dataset after we have the `beers_reviews` dataset. - -[source,r] ----- -rm(reviews) ----- -==== - -[TIP] -==== -If you want to update how you calculated your `beer_goodness_indicator` from the previous project, this would be a great time to do so! -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Are there any differences in terms of `abv` between beers that are available in specific seasons? - -[NOTE] -==== -ABV refers to the alcohol by volume of a beer. The higher the ABV, the more alcohol is in the beer. -==== - -1. Filter the `beers_reviews` dataset to contain beers available only in a specific season (`Fall`, `Winter`, `Spring`, `Summer`). -+ -[TIP] -==== -Only click below if you are stuck! - -https://dplyr.tidyverse.org/reference/filter.html[This] function will help you do this operation. -==== -+ -2. Make a side-by-side boxplot comparing `abv` for each season `availability`. -+ -[TIP] -==== -Only click below if you are stuck! - -https://ggplot2.tidyverse.org/reference/geom_boxplot.html[This] function will help you do this operation. -==== -+ -3. Make sure to use the `labs` function to have nice x-axis label and y-axis label. -+ -[TIP] -==== -https://ggplot2.tidyverse.org/reference/labs.html?q=labs#null[This] is more information on `labs`. -==== - -Use pipelines, resulting in a single chunk of "piped-together" code. - -[TIP] -==== -Use the `fill` argument to https://ggplot2.tidyverse.org/reference/geom_boxplot.html[this] function to color your boxplots differently for each season. -==== - -Write 1-2 sentences comparing the beers in terms of `abv` between the specific seasons. Are the results surprising or did you expect them? - -[TIP] -==== -The `aes` part of many `ggplot` plots may be confusing at first. In a nutshell, `aes` is used to match x-axis and y-axis values to columns of data in the given dataset. You should read https://ggplot2.tidyverse.org/reference/aes.html[this] documentation and the examples carefully to better understand. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences comparing the beers in terms of `abv` between the specific seasons. Are the results surprising or did you expect them? -==== - -=== Question 3 - -Modify your code from question 2 to: - -1. Create a new variable `is_good` that is 1 or TRUE if `beer_goodness_indicator` is greater than 3.5, and 0 or FALSE otherwise. -2. _Facet_ your boxplot based on `is_good`. The resulting graphic should make it easy to compare the "good" vs "bad" beers for each season. -+ -[TIP] -==== -https://ggplot2.tidyverse.org/reference/facet_grid.html[`facet_grid`] and https://ggplot2.tidyverse.org/reference/facet_wrap.html[`facet_wrap`] are two other functions that can be a bit confusing at first. With that being said, they are incredible powerful and make creating really impressive graphics very straightforward. -==== - -[IMPORTANT] -==== -Make sure to use piping `%>%` as well as layers (`+`) to create your final `ggplot` plot, using a single chunk of piped/layered code. -==== - -How do beers differ in terms of ABV and being considered good or not (based on our definition) for the different seasons? Write 1-2 sentences describing what you see based on the plots. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences answering the question. -==== - -=== Question 4 - -Modify your code from question 3 to answer the question based on summary statistics instead of graphical displays. - -Make sure you compare the ABV per season `availability` and `is_good` using `mean`, `median` and `sd`. Your final dataframe should have 8 rows and the following columns: `is_good`, `availability`, `mean_abv`, `median_abv`, `std_abv`. - -[TIP] -==== -The following function will be useful for this question: https://dplyr.tidyverse.org/reference/filter.html[`filter`], https://dplyr.tidyverse.org/reference/mutate.html[`mutate`], https://dplyr.tidyverse.org/reference/group_by.html[`group_by`], https://dplyr.tidyverse.org/reference/summarise.html[`summarize`] (within summarize: `mean`, `median`, `sd`). -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -In this question, we want to make comparison in terms of `ABV` and `beer_goodness_indicator` for US states. - -Feel free to use whichever data-driven method you desire to answer this question! You can take summary statistics, make a variety of plots, and even filter to compare specific US states -- you can even create new columns combining states (based on region, political affiliation, etc). - -Write a question related to US states, ABV and our "beer_goodness_indicator". Use your data-driven method(s) to answer it (if only anecdotally). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Write 1-2 sentences explaining your question and data-driven method(s). -- Write 1-2 sentences answering your question. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project12.adoc deleted file mode 100644 index 957680e21..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project12.adoc +++ /dev/null @@ -1,185 +0,0 @@ -= TDM 20200: Project 12 -- 2023 - -**Motivation:** As we mentioned before, data wrangling is a big part in any data driven project. https://www.amazon.com/Exploratory-Data-Mining-Cleaning/dp/0471268518["Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis."] Therefore, it is worth to spend some time mastering how to best tidy up our data. - -**Context:** We are continuing to practice using various `tidyverse` packages, in order to wrangle data. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use `mutate`, `pivot`, `unite`, `filter`, and `arrange` to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using `group_by`, `mutate`, `summarize`, and `transform` functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -As usual, we will start by loading `tidyverse` package (using the `read_csv` function) and reading the processed data `complaints.csv` into a tibble called `dat`. - -The first step in many projects is to define the problem statement and goal. That is, to understand why the project is important and what are our desired delivarables. For this project and the next, we will assume that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions. - -What is the type of data in the `date_sent_to_company` and `date_received` columns? Do you think that these columns are in a good format to calculate the wait time between receiving and sending the complaint to the company? No need to overthink anything -- from your perspective, how simple/complicated would be the steps to calculate the number of days between received and sent to the company, if the data remaining in the current format? - -[TIP] -==== -The `glimpse` function is a good function to get a sample of many columns of data at once. Althought it may be more difficult to read at first (than say, `head`), since it lists a single row for each column, it is better when you have many columns in a dataset. - -Also, try to keep using the pipes (`%>%`) for the entire project. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1 sentence answering the type of columns `date_sent_to_company` and `date_received`. -- 1-2 sentences commenting on if you think it would be easy to calculate calculate the wait time between receiving and sending the complaint to the company if the type of data remained as is. -==== - -=== Question 2 - -Tidyverse has a few "data type"-specific packages. For example, the package `lubridate` is fantastic when dealing with dates, and the package `stringr` was made to help when dealing with text (`chr`). Although `lubridate` is a part of the `tidyverse` universe, it does not get loaded when we run `library(tidyverse)`. - -Begin this question by loading the `lubridate` package. Use the appropriate function to convert the columns refering to dates to a date format. - -[TIP] -==== -There are lots of really great (and "official") cheat sheets for `tidyverse` packages. I find these immensely useful and almost always pull the cheat sheet up when I use a `tidyverse` package. - -https://www.rstudio.com/resources/cheatsheets/[Here] are the cheat sheets. - -https://raw.githubusercontent.com/rstudio/cheatsheets/main/lubridate.pdf[Here] is the `lubridate` cheat sheet, which I think is particularly good. -==== - -Try to solve this question using the `mutate_at` function. - -[IMPORTANT] -==== -You will notice a pattern within the `tidyverse` of functions named `*_at`, `*_if`, and `*_all`. For example, for the `mutate` and `summarize` functions there are versions like `mutate_at` or `summarize_if`, etc. These variants of the functions are useful for applying the functions to relevant columns without having to specify individual columns by name. -==== - -[TIP] -==== -Take a look at the functions: `ydm`, `ymd`, `dym`, `mdy`, `myd`, `dmy`. -==== - -[TIP] -==== -If you are using `mutate_at`, run the followings command and see what happens. - -[source,r] ----- -dat %>% - select(contains('product')) %>% - head() - -dat %>% - summarize_at(vars(contains('product')), function(x) length(unique(x))) - -dat %>% - group_by(product) %>% - summarize_if(is.numeric, mean) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Result from running `glimpse(dat)`. -==== - -=== Question 3 - -Add a new column called `wait_time` that is the difference between `date_sent_to_company` and `date_received` in days. - -[TIP] -==== -You may want to use the argument `units` in the `difftime` function. -==== - -[TIP] -==== -Remember that `mutate` is the function you want to use to add a new column based on other columns. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Frequently, we need to perform many data wrangling tasks: changing data types, performing operations more easily, summarizing the data, and pivoting to be enable plotting certain types of plots. - -So far, we spent most of the project doing just that! Now that we have the `wait_time` in a format that allows us to plot it, let's start creating some tips and plots to increase costumer satisfaction. One thing we may want to do is give the user information on the `wait_time` based on how the complaint was submitted (`submitted_via`). Compare `wait_time` values by how they are submitted. - -[NOTE] -==== -Keep in mind that we want to present this information in a way that would be helpful to costumers. For example, if you summarized the data, you could present the information as a tip and include the summarized `wait_time` values for the fastest and slowest methods. If you are making a plot and the plot has tons of outliers, maybe we want to consider cutting our axis (or filtering) the data to include just the certain values. -==== - -Be sure to explain your reasoning for each step of your analysis. If you are summarizing, why did you pick this method, and why are you summarizing the way you are (for example, are you using the average time, the median time, the maximum time, the `mean(wait_time) + 3*std(wait_time)`)? You may also want to create 3 categories of `wait_time` (small, medium, high) and do a `table` between the categorical wait time and submission types. Why are you presenting the information the way you are? - -[NOTE] -==== -Figuring out how to present the information to help someone make a decision is an important step in any project! You may very well be presenting to someone that is not as familiar with data science/statistics/computer science as you are. -==== - -[TIP] -==== -If you are creating categorical wait time, take a look at the https://dplyr.tidyverse.org/reference/case_when.html[`case_when`] function. -==== - -[TIP] -==== -One example could be: - ----- -The plot below shows the average time it takes for the company to receive your complaint after you sent it based on _how_ you sent it. Note that, on average, it takes XX days to get a response if you submitted via YY. Alternatively, it takes, on avaerage, twice as long to receive a response if you submit a complain via ZZ. Be sure to keep this in mind when submitting a complaint. ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaning your reasoning for how you presented the information. -- Information for costumer to make decision (plot, tip, etc). -==== - -=== Question 5 - -Note that we have a column called `timely_response` in our `dat`. It may or may not (in reality) _be_ related to `wait_time`, however, we would expect it to be. What would you expect to see? Compare `wait_time` to `timely_response` using any technique you'd like. You can use the same idea/technique from question 4, or you can pick something else entirely different. It is completely up to you! - -Would this information be relevant to include in a tip or dashboard for a costumer to make their decision? Why or why not? Would you combine this information with the one for `wait_time`? If so, how? - -Sometimes there are many ways to present similar pieces of information, and we must decide what we believe makes most sense, and what will be most helpful when making a decision. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences comparing `wait_time` for timely and not timely responses. -- 1-2 sentences explaining whether you would include this information for costumers, and why or why not? If so, how would you include it? -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project13.adoc deleted file mode 100644 index 4a2e081de..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project13.adoc +++ /dev/null @@ -1,129 +0,0 @@ -= TDM 20200: Project 13 -- 2023 - -**Motivation:** Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we've practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data. - -**Context:** We will continue to gain familiarity with the `tidyverse` suite of packages (including `ggplot`), and data wrangling tasks. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use `mutate`, `pivot`, `unite`, `filter`, and `arrange` to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using `group_by`, `mutate`, `summarize`, and `transform` functions. -- Demonstrate the ability to create basic graphs with default settings, in `ggplot`. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/consumer_complaints/complaints.csv` - -== Questions - -=== Question 1 - -Just like in Project 12, we will start by loading `tidyverse` package and reading (using the `read_csv` function) the processed data `complaints.csv` into a tibble called `dat`. Make sure to also load the `lubridate` package. - -In project 12, we stated that that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions. - -We started by providing some information on wait time, `timely_response`, and it's association with how the complaint was submitted (`submitted_via`). - -Let's continue our exploration of this dataset to provide valuable information to clients. Create a new column called `day_sent_to_company` that contais the day of the week the complaint was sent to the company (`date_sent_to_company`). To create the new data, use some of your code from project 12 that changes the format of `date_sent_to_company` to the correct format, pipes (`%>%`), and the appropriate function from `lubridate`. - -[NOTE] -==== -Some students asked about whether or not you _need_ to use the pipes (`%>%`). The answer is no! Of course, you are free to use them if you'd like. I _think_ that with a little practice, it will become more natural. "Tidyverse code" tends too look a lot like: - -[source,r] ----- -dat %>% - filter(...) %>% - mutate(...) %>% - summarize(...) %>% - ggplot() + - geom_point(...) + - ... ----- - -Some people like it, some don't. You can draw your own conclusions, but I'd give all common methods a shot to see what you prefer. -==== - -Also, try to keep using the pipes (`%>%`) for the entire project (again, you don't _need_ to, but we'd encourage you to try). Your code for question one should look something like this: - -[source,r] ----- -dat <- dat %>% - insert_correct_function_here( - insert_code_to_change_weekday_complaint_sent_to_proper_format_here, - insert_code_to_get_day_of_the_week_here - ) ----- - -[TIP] -==== -You may want to use the argument `label` form the `wday` function. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Before we continue with all of the fun, let's do some sanity checks on the column we just created. Sanity checks are an important part any data science project, and should be performed regularly. - -Use your critical thinking to perform at least two sanity checks on the new column `day_sent_to_company`. The idea is to take a quick look at the data from this column and check if it seems to make sense to us. Sometimes we know the exact values we should get and that helps be even more certain. Sometimes those sanity checks are not as foolproof, and are just ways to get a feel for the data and make sure nothing is looking weird right away. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences explaining what you used to perform sanity checks, and what are your results. Do you feel comfortable moving forward with this new column? -==== - -=== Question 3 - -Using your code from questions 1 and 2, create another new column called `day_received` that is the week day the complaint was received. Use sanity checks to double check that everything seems to be in order. - -Let's use these new columns and make some additional recommendations to our consumers! Using at least one of the columns `day_received` and `day_sent_to_company` with the rest of the data to see whether the consumer disputed the result (`consumer_disputed`), create a tip or a plot to help consumer make decisions. - -[NOTE] -==== -Note that the column `consumer_disputed` is a character column, so make sure you take that into consideration. Depending on how you want to summmarize and/or present the information you may need to modify this format, or use a different function to get that information. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Recommendation for consumer in form of a chart with a legend and/or tip. -==== - -=== Question 4 - -Looking at the columns we have in the dataset, come up with a question whose answer can be used to help consumers make decisions. It is ok if the answer to your question doesn't provide the most insightful information -- for instance, finding out two variables are not correlated can still be valuable information! - -Use your skills to answer the question. Transform your answer to a "tip" with an accompanying plot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 1-2 sentences with your question. -- Answer to your question. -- Recommendation to consumer via tip and plot. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project14.adoc deleted file mode 100644 index 55d045256..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-project14.adoc +++ /dev/null @@ -1,148 +0,0 @@ -= TDM 20200: Project 14 -- 2023 - -**Motivation:** Rearranging data to and from "long" and "wide" formats _sounds_ like a difficult task, however, `tidyverse` has a variety of function that make it easy. - -**Context:** This is the last project for the course. This project has a focus on how data can change when grouped differently, and using the `pivot` functions. - -**Scope:** R, tidyverse, ggplot - -.Learning Objectives -**** -- Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems. -- Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows. -- Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions. -- Demonstrate the ability to create basic graphs with default settings, in ggplot. -- Demonstrate the ability to modify axes labels and titles. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/death_records/DeathRecords.csv` - -== Questions - -=== Question 1 - -Calculate the average age of death for each of the `MaritalStatus` values and create a `barplot` using `ggplot` and `geom_col`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Now, let's further group our data by `Sex` to see how the patterns change (if at all). Create a side-by-side bar plot where `Sex` is shown for each of the 5 `MaritalStatus` values. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -In the previous question, before you piped the data into `ggplot` functions, you likely used `group_by` and `summarize`. Take, for example, the following. - -[source,r] ----- -dat %>% - group_by(MaritalStatus, Sex) %>% - summarize(age_of_death=mean(Age)) ----- - -.output ----- -MaritalStatus Sex age_of_death - -D F 70.34766 -D M 65.60564 -M F 69.81002 -M M 73.05787 -S F 56.83075 -S M 49.12891 -U F 80.80274 -U M 80.27476 -W F 85.69817 -W M 83.98783 ----- - -Is this data "long" or "wide"? - -There are multiple ways we could make this data "wider". Let's say, for example, we want to rearrange the data so that we have the `MaritalStatus` column, a `M` column, and `F` column. The `M` column contains the average age of death for males and the `F` column the same for females. While this may sound complicated to do, `pivot_wider` makes this very easy. - -Use `pivot_wider` to rearrange the data as described. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Create a ggplot plot for each month. Each plot should be a barplot with the `as.factor(DayOfWeekOfDeath)` on the x-axis and the count on the y-axis. The code below provides some structure to help get you started. - -[source,r] ----- -g <- list() # to save plots to -for (i in 1:12) { - g[[i]] <- dat %>% - filter(...) %>% - ggplot() + - geom_bar(...) -} - -library(patchwork) -library(repr) - -# change plot size to 12 by 12 -options(repr.plot.width=12, repr.plot.height=12) - -# use patchwork to display all plots in a grid -# https://cran.r-project.org/web/packages/patchwork/vignettes/patchwork.html ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Question 4 is a bit tedious. `tidyverse` provides a _much_ more ergonomic way to create plots like this. Use https://ggplot2.tidyverse.org/reference/facet_wrap.html[`facet_wrap`] to create the same plot. - -[TIP] -==== -You do _not_ need to use a loop to solve this problem anymore. In face, you only need to add 1 more line of code to this part. - -[source,r] ----- -dat %>% - filter(....) %>% - ggplot() + - geom_bar(...) + - # new stuff here ----- -==== - -Are there any patterns in the data that you find interesting? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-projects.adoc deleted file mode 100644 index 3f76771e6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/20200/20200-2023-projects.adoc +++ /dev/null @@ -1,47 +0,0 @@ -= TDM 20200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$20200-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2022 to Spring 2023. -==== - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm20100[https://piazza.com/purdue/fall2022/tdm20100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm20100/home[https://piazza.com/purdue/fall2022/tdm20100/home] - - -== Syllabus - -Navigate to the xref:spring2023//logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project01.adoc deleted file mode 100644 index a5c05ceae..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project01.adoc +++ /dev/null @@ -1,151 +0,0 @@ -= TDM 30200: Project 1 -- 2023 - -**Motivation:** Welcome back! This semester _should_ be a bit more straightforward than last semester in many ways. In the first project back, we will do a bit of UNIX review, a bit of Python review, and I'll ask you to learn and write about some terminology. - -**Context:** This is the first project of the semester! We will be taking it easy and _slowly_ getting back to it. - -**Scope:** UNIX, Python - -.Learning Objectives -**** -- Differentiate between concurrency and parallelism at a high level. -- Differentiate between synchronous and asynchronous. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -=== Question 1 - -Google the difference between synchronous and asynchronous -- there is a _lot_ of information online about this. - -Explain what the following tasks are (in day-to-day usage) and why: asynchronous, or synchronous. - -- Communicating via email. -- Watching a live lecture. -- Watching a lecture that is recorded. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Given the following scenario and rules, explain the synchronous and asynchronous ways of completing the task. - -You have 2 reports to write, and 2 wooden pencils. 1 sharpened pencil will write 1/2 of 1 report. You have a helper that is willing to sharpen 1 pencil at a time, for you, and that helper is able to sharpen a pencil in the time it takes to write 1/2 of 1 report. - -[IMPORTANT] -==== -Please assume you start with 2 sharpened pencils. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Write Python code that simulates the scenario in question (2) that is synchronous. Make the time it takes to sharpen a pencil be 2 seconds. Make the time it takes to write .5 reports 5 seconds. - -[TIP] -==== -Use `time.sleep` to accomplish this. -==== - -How much time does it take to write the reports in theory? - -[IMPORTANT] -==== -Here is some skeleton code to get you started. - -[source,python] ----- -def sharpen_pencil(pencil: dict) -> dict: - if pencil['is_sharp']: - return pencil - else: - time.sleep(2) - pencil['is_sharp'] = True - return pencil - -def write_essays(number_essays: int, pencils: List[dict]): - # fill in here - # make sure both pencils are sharpened and sharpen both otherwise - - # write half the essay - - # dull first pencil - - # write the other half essay - - # dull second pencil - pass - -def simulate_story(): - pencils = [{'name': 'pencil1', 'is_sharp': True}, {'name': 'pencil1', 'is_sharp': True}] - write_essays(2, pencils) ----- - -[source,ipython] ----- -%%time - -simulate_story() ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Read https://stackoverflow.com/questions/50757497/simplest-async-await-example-possible-in-python[the StackOverflow post] and think about the scenario in question (2) that is asynchronous. Assume the time it takes to sharpen a pencil is 2 seconds and the time it takes to write .5 reports is 5 seconds. - -How much time does it take to write the reports in theory, if you use the asynchronous method? Explain. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -In your own words, describe the difference between concurrency and parallelism. Then, look at the flights datasets here: `/anvil/projects/tdm/data/flights/subset`. Describe an operation that you could do to the entire dataset as a whole. Describe how you (in theory) could parallelize that process. - -Now, assume that you had the entire frontend system at your disposal. Use a UNIX command to find out how many cores the frontend has. If processing 1 file took 10 seconds to do. How many seconds would it take to process all of the files? Now, approximately how many seconds would it take to process all the files if you had the ability to parallelize on this system? - -Don't worry about overhead or the like. Just think at a very high level. - -[TIP] -==== -Best make sure this sounds like a task you'd actually like to do -- I _may_ be asking you to do it in the not-too-distant future. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project02.adoc deleted file mode 100644 index e55f49f31..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project02.adoc +++ /dev/null @@ -1,591 +0,0 @@ -= TDM 30200: Project 2 -- 2023 - -**Motivation:** The the previous project, we very slowly started to learn about asynchronous vs synchronous programming. Mostly, you just had to describe scenarios, whether they were synchronous or asynchronous, and you had to explain things at a high level. In this project, we will dig into some asynchronous code, and learn the very basics. - -**Context:** This is the first project in a series of three projects that explore sync vs. async, parallelism, and concurrency. - -**Scope:** Python, coroutines, tasks - -.Learning Objectives -**** -- Understand the difference between synchronous and asynchronous programming. -- Identify, create, and await coroutines. -- Properly identify the order in which asynchronous code is executed. -- Utilize 1 or more synchronizing primitives to ensure that asynchronous code is executed in the correct order. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/*` - -== Questions - -=== Question 1 - -In the previous project, I gave you the following scenario. - -[quote, , the-examples-book.com] -____ -You have 2 reports to write, and 2 wooden pencils. 1 sharpened pencil will write 1/2 of 1 report. You have a helper that is willing to sharpen 1 pencil at a time, for you, and that helper is able to sharpen a pencil in the time it takes to write 1/2 of 1 report. You can assume that you start with 2 sharpened pencils. -____ - -In this _asynchronous_ example, the author could start with the first sharpened pencil and write 1/2 of the report in 5 seconds. Next, hand the first pencil off to the assistant to help sharpen it. While that is happening, use the second pencil to write the second half of the first report. Next, receive the first (now sharpened) pencil back from the assistant and hand the second pencil to the assistant to be sharpened. While the assistant was sharpening the second pencil, you would write the first half of the second report. The assistant would return the (now sharpened) second pencil back to you to finish the second report. This process would (in theory) take 20 seconds as the assistant would be sharpening pencils at the same time you are writing the report. As an effect, you could exclude the 4 seconds it takes to sharpen both pencils once, from our synchronous solution of 24 seconds. - -In this project we will examine how to write asynchronous code that simulates the scenario, in a variety of ways that will teach you how to write asynchronous code. At the end of the project, you will write your own asynchronous code that will speed up a web scraping task. Let's get started! - -[WARNING] -==== -Jupyter Lab has its own event loop already running, which causes problems if you were to try to run your own event loop. Async code uses an event loop, so this poses a problem. To get by this, we can use a package that automatically _nests_ our event loops, so things work _mostly_ as we would expect. - -[source,python] ----- -import asyncio -import nest_asyncio -nest_asyncio.apply() - -asyncio.run(simulate_story()) ----- -==== - -Fill in the skeleton code below to simulate the scenario. Use **only** the provided functions, `sharpen_pencil`, and `write_half_report`, and the `await` keyword. - -[source,python] ----- -async def sharpen_pencil(): - await asyncio.sleep(2) - -async def write_half_report(): - await asyncio.sleep(5) - -async def simulate_story(): - # Write first half of report with first pencil - - # Hand pencil off to assistant to sharpen - - # Write second half of report with second pencil - - # Hand second pencil back to assistant to sharpen - # take first (now sharpened) pencil back from assistant - - # Write the first half of second essay with first pencil - - # Take second (now sharpened) pencil back from assistant - # and write the second half of the second report ----- - -Run the simulation in a new cell as follows. - -[source,ipython] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -asyncio.run(simulate_story()) ----- - -How long does you code take to run? Does it take the expected 20 seconds? If you have an idea why or why not, please try to explain. Otherwise, just say "I don't know". - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -If you don't have any experience writing asynchronous code, this might be pretty confusing! That's okay, it is _much_ easier to get confused writing asynchronous code than it is to write synchronous code. In fact, it is safe to say that writing good parallel and/or asynchronous code is significantly more difficult than writing _non_ async/parallel code. - -Let's break it down. First, the `asyncio.run` function takes care of the bulk of the work. It starts the _event loop_, finalizes asynchronous generators, and closes the threadpool. All you need to take from it is "it takes care of a lot of ugly magic". - -Any function that starts with `async` is an asynchronous function. _Calling_ an async function produces a _coroutine_, nearly instantly. A coroutine is a function that has the ability to have its progress be paused and resumed at will. - -For example, if you called the following async function, it will not execute, but rather it will just create a coroutine object. - -[source,python] ----- -async def foo(): - await asyncio.sleep(5) - print("Hello") - -foo() ----- - -.Output ----- - ----- - -This result should be shown nearly instantly, the sleep code hasn't actually been run! In fact, if your async code runs way faster than expected, this may be a sign that you've forgotten to _await_ a coroutine. Don't worry too much about that for now. - -To actually run the coroutine, you need to call the `asyncio.run` function. - -[source,python] ----- -asyncio.run(foo()) ----- - -.Output ----- -Hello ----- - -Of course, it doesn't make sense to call `asyncio.run` for each and every coroutine you create. It makes more sense to spin up the event loop once and handle the processes while it is running. - -[source,ipython] ----- -%%time - -import asyncio -import nest_asyncio -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await foo() - await bar() - -asyncio.run(main()) ----- - -Run the code, what is the output? - -Let's take a step back. _Why_ is asynchronous code useful? What do our `asyncio.sleep` calls represent? One of the slowest parts of a program is waiting for I/O or input/output. It takes time to wait for the operating system and hardware. If you are doing a lot of I/O in your program, you could take advantage and perform other operations while waiting! In our example, this is what the `asyncio.sleep` calls _represent_ -- I/O! - -Any program where the IO speed limits the speed of the program is called _I/O Bound_. Any program where the program speed is limited by how fast the CPU can process the instructions is called _CPU Bound_. Async programming can drastically speed up _I/O Bound_ software! - -Okay, back to the code from above. What is the output? You may have expected `foo` to run, then, while `foo` is "doing some IO (sleeping)", `bar` will run. Then, in a total of 5 seconds, you may have expected "World Hello" to be printed. While the `foo` is sleeping, `bar` runs, gets done in 2 seconds, goes back to `foo` and finishes in another 3 seconds, right? Nope. - -What happens is that when we _await_ for `foo`, Python suspends the execution of `main` until `foo` is done. Then it resumes execution of `main` and suspends it again until `bar` is done for an approximate time of 7 seconds. We want both coroutines to run concurrently, not one at a time! How do we fix it? The easiest would be to use `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -`asyncio.gather` takes a list of awaitable objects (coroutines are awaitable objects) and runs them concurrently by scheduling them as a _task_. Running the code above should work as expected, and run in approximately 5 seconds. We gain 2 seconds in performance since both `foo` and `bar` run concurrently. While `foo` is sleeping, `bar` is running and completes. We gain 2 seconds while those functions overlap. - -What is a _task_? You can read about tasks https://docs.python.org/3/library/asyncio-task.html#asyncio.Task[here]. A task is an object that runs a coroutine. The easiest way to create a task is to use the `asyncio.create_task` method. For example, if instead of awaiting both `foo` and `bar`, we scheduled `foo` as a task, you would get _mostly_ the same result as if you used `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.create_task(foo()) - await bar() - -asyncio.run(main()) ----- - -As you can see, "World" prints in a couple of seconds, and 3 seconds later "Hello" prints, for a total execution time of 5 seconds. With that being said, something is odd with our output. - -.Output ----- -World -CPU times: user 2.57 ms, sys: 1.06 ms, total: 3.63 ms -Wall time: 2 s -Hello ----- - -It says that it executed in 2 seconds, not 5. In addition, "Hello" prints _after_ Jupyter says our execution completed. Why? Well, if you read https://docs.python.org/3/library/asyncio-task.html#creating-tasks[here], you will see that `asyncio.create_task` takes a coroutine (in our case the output from `foo()`), and schedules it as a _task_ in the event loop returned by `asyncio.get_running_loop()`. This is the critical part -- it is scheduling the coroutine created by `foo()` to run on the same event loop that Jupyter Lab is running on, so even though our event loop created by `asyncio.run` stopped execution, `foo` ran until complete instead of cancelling as soon as `bar` was awaited! To observe this, open a terminal and run the following code to launch a Python interpreter: - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 -python3 ----- - -Then, in the Python interpreter, run the following. - -[NOTE] -==== -You may need to type it out manually. -==== - -[source,python] ----- -import asyncio - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.create_task(foo()) - await bar() - -asyncio.run(main()) ----- - -As you can see, the output is _not_ the same as when you run it from _within_ the Jupyter notebook. Instead of: - -.Output ----- -World -CPU times: user 2.57 ms, sys: 1.06 ms, total: 3.63 ms -Wall time: 2 s -Hello ----- - -You should get: - -.Output ----- -World ----- - -This is because this time, there is no confusion on which event loop to use when scheduling a task. Once we reach the end of `main`, the event loop is stopped and any tasks scheduled are terminated -- even if they haven't finished (like `foo`, in our example). If you wanted to modify `main` in order to wait for `foo` to complete, you could _await_ the task _after_ you await `bar()`. - -[IMPORTANT] -==== -Note that this will work: - -[source,python] ----- -async def main(): - task = asyncio.create_task(foo()) - await bar() - await task ----- - -But this, will not: - -[source,python] ----- -async def main(): - task = asyncio.create_task(foo()) - await task - await bar() ----- - -The reason is that as soon as you call `await task`, `main` is suspended until the task is complete, which prevents both coroutines from executing concurrently (and we miss out on our 2 second performance gain). If you wait to call `await task` _after_ `await bar()`, our task (`foo`) will continue to run concurrently as a task on our event loop along side `bar` (and we get our 2 second performance gain). In addition, `asyncio.run` will wait until `task` is finished before terminating execution, because we awaited it at the very end. -==== - -In the same way that `asyncio.create_task` schedules the coroutines as tasks on the event loop (immediately), so does `asyncio.gather`. In a previous example, we _awaited_ our call to `asyncio.gather`. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - await asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -.Output ----- -World -Hello -CPU times: user 3.41 ms, sys: 1.96 ms, total: 5.37 ms -Wall time: 5.01 s ----- - -This is critical, otherwise, `main` would execute immediately and terminate before either `foo` or `bar` finished. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def foo(): - await asyncio.sleep(5) - print("Hello") - -async def bar(): - await asyncio.sleep(2) - print("World") - -async def main(): - asyncio.gather(foo(), bar()) - -asyncio.run(main()) ----- - -.Output ----- -CPU times: user 432 µs, sys: 0 ns, total: 432 µs -Wall time: 443 µs -World -Hello ----- - -As you can see, since we did not await our `asyncio.gather` call, `main` ran and finished immediately. The only reason "World" and "Hello" printed is that they finished running on the event loop that Jupyter uses instead of the loop we created using our call to `asyncio.run`. If you were to run the code from a Python interpreter instead of from Jupyter Lab, neither "World" nor "Hello" would print. - -[CAUTION] -==== -I know this is a _lot_ to take in for a single question. If you aren't quite following at this point I'd highly encourage you to post questions in Piazza before continuing, or rereading things until it starts to make sense. -==== - -Modify your `simulate_story` function from question (1) so that `sharpen_pencil` runs concurrently with `write_quarter`, and the total execution time is about 20 seconds. - -[IMPORTANT] -==== -Some important notes to keep in mind: - -- Make sure that the "rules" are still followed. You can still only write 1 quarter of the report at a time. -- Make sure that your code awaits what needs to be awaited -- even if _technically_ those tasks would execute prior to `simulate_story` finishing. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -That last question was quite a bit to take in! It is ok if it hasn't all clicked! I'd encourage you to post questions in Piazza, and continue to mess around with simple async examples until it makes more sense. It will help us explain things better and improve things for the next group of students! - -There are a couple of straightforward ways you could solve the previous question (well technically there are even more). One way involves queuing up the `sharpen_pencil` coroutines as tasks that run concurrently, and awaiting them at the end. The other involves using `asyncio.gather` to queue up select `write_quarter` and `sharpen_pencil` tasks to run concurrently, and await them. - -While both of these methods do a great job simulating our simple story, there may be instances where a greater amount of control may be needed. In such circumstances, https://docs.python.org/3/library/asyncio-sync.html[the Python synchronization primitives] may be useful! - -Read about the https://docs.python.org/3/library/asyncio-sync.html#asyncio.Event[Event primitive], in particular. This primitive allows us to notify one or more async tasks that _something_ has happened. This is particularly useful if you want some async code to wait for other async code to run before continuing on. Cool, how does it work? Let's say I want to yell, but before I yell, I want the megaphone to be ready. - -First, create an event, that represents some event. - -[source,python] ----- -import asyncio - -async def yell(words, wait_for): - print(f"{words.upper()}") - -# create an event -megaphone_ready = asyncio.Event() ----- - -To wait to continue until the event has occurred, you just need to `await` the coroutine created by calling `my_event.wait()`. So in our case, we can add `my_event.wait()` before we yell in the `yell` function. - -[source,python] ----- -async def yell(words, wait_for): - await wait_for.wait() - print(f"{words.upper()}") ----- - -By default, our `Event` is set to `False` since the event has _not_ occurred. The `yell` task will continue to await our event until the event is marked as _set_. To mark our event as set, we would use the `set` method. - -[source,python] ----- -import asyncio - -async def yell(words, wait_for): - await wait_for.wait() - print(f"{words.upper()}") - -async def main(): - megaphone_ready = asyncio.Event() # by default, it is not ready - - # create our yell task. Remember, tasks are immediately scheduled - # on the event loop to run. At this point, the await wait_for.wait() - # part of our yell function will prevent the task from moving - # forward to the print statement until the event is set. - yell_task = asyncio.create_task(yell("Hello", megaphone_ready)) - - # let's say we have to dust off the megaphone for it to be ready - # and it takes 2 seconds to do so - await asyncio.sleep(2) - - # now, since we've dusted off the megaphone, we can mark it as ready - megaphone_ready.set() - - # at this point in time, the await wait_for.wait() part of our code - # from the yell function will be complete, and the yell function - # will move on to the print statement and actually yell - - # Finally, we want to await for our yell_task to finish - # if our yell_task wasn't a simple print statement, and tooks a few seconds - # to finish, this await would be necessary for the main function to run - # to completion. - await yell_task - -asyncio.run(main()) ----- - -Consider each of the following as a separate event: - -- Writing the first quarter of the report -- Writing the second quarter of the report -- Writing the third quarter of the report -- Writing the fourth quarter of the report -- Sharpening the first pencil -- Sharpening the second pencil - -Use the `Event` primitive to make our code run as intended, concurrently. Use the following code as a skeleton for your solution. Do **not** modify the code, just make additions. - -[source,python] ----- -%%time - -import asyncio -import nest_asyncio - -nest_asyncio.apply() - -async def write_quarter(current_event, events_to_wait_for = None): - # TODO: if events_to_wait_for is not None - # loop through the events and await them - - await asyncio.sleep(5) - - # TODO: at this point, the essay quarter has - # been written and we should mark the current - # event as set - - -async def sharpen_pencil(current_event, events_to_wait_for = None): - # TODO: if events_to_wait_for is not None - # loop through the events and await them - - await asyncio.sleep(2) - - # TODO: at this point, the essay quarter has - # been written and we should mark the current - # event as set - - -async def simulate_story(): - - # TODO: declare each of the 6 events in our story - - # TODO: add each function call to a list of tasks - # to be run concurrently. Should be something similar to - # tasks = [write_quarter(something, [something,]), ...] - tasks = [] - - await asyncio.gather(*tasks) - -asyncio.run(simulate_story()) ----- - -[TIP] -==== -The `current_event` is passed so we can mark it as set once the event has occurred. -==== - -[TIP] -==== -The `events_to_wait_for` is passed so we can await them before continuing. This ensures that we don't try and sharpen the first pencil until after we've written the first quarter of the essay. Or ensures that we don't write the third quarter of the essay until after the first pencil has been sharpened. -==== - -[TIP] -==== -The code you will add to `write_quarter` will be identical to the code you will add to `sharpen_pencil`. -==== - -[TIP] -==== -The `events_to_wait_for` is expected to be iterable (a list). Make sure you pass a single event in a list if you only have one event to wait for. -==== - -[TIP] -==== -It should take about 20 seconds to run. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -While it is certainly useful to have some experience with async programming in Python, the context in which most data scientists will deal with it is writing APIs using something like `fastapi`, where a deep knowledge of async programming isn't really needed. - -What _will_ be pretty common is the need to speed up code. One of the primary ways to do that is to parallelize your code. - -In the previous project, in question (5), you described an operation that you could do to the entire flights dataset (`/anvil/projects/tdm/data/flights/subset`). In this situation, where you have a collection of neatly formatted datasets, a good first step would be to write a function that accepts a two paths as arguments. The first path could be the absolute path to the dataset to be processed. The second path could be the absolute path of the intermediate output file. Then, the function could process the dataset and output the intermediate calculations. - -For example, let's say you wanted to count how many flights in the dataset as a whole. Then, you could write a function to read in the dataset, count the flights, and output a file containing the number of flights. This would be easily parallelizable because you could process each of the files individually, in parallel, and at the very end, sum up the data in the output file. - -Write a function that is "ready" to be parallelized, and that follows the operation you described in question (5) in the previous project. Test out the function on at least 2 of the datasets in `/anvil/projects/tdm/data/flights/subset`. - -[TIP] -==== -In the next project, we will parallelize and run some benchmarks. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project03.adoc deleted file mode 100644 index b3b9eee31..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project03.adoc +++ /dev/null @@ -1,202 +0,0 @@ -= TDM 30200: Project 3 -- 2023 - -**Motivation:** When working with large amounts of data, it is sometimes critical to take advantage of modern hardware and _parallelize_ the computation. Depending on the problem, parallelization can massively reduce the amount of time to process something. - -**Context:** This is the second in a series of 3 projects that explore sync vs. async, parallelism, and concurrency. For some, the projects so far may have been a bit intense. This project will slow down a bit, run some fun experiments, and try to _start_ clarifying some confusion that is sometimes present with terms like threads, concurrency, parallelism, cores, etc. - -**Scope:** Python, threads, parallelism, concurrency, joblib - -.Learning Objectives -**** -- Distinguish between threads and processes at a high level. -- Demonstrate the ability to parallelize code. -- Identify and approximate the amount of time to process data after parallelization. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/flights/subset/*.csv` - -== Questions - -=== Question 1 - -[IMPORTANT] -==== -Make sure to request a notebook with only 1 core to start this project. -==== - -`joblib` is a Python library that makes many parallelization tasks easier. Run the following code in three separate code cells. But, before you do, look at the code and write down approximately how much time you think each cell will take to run. 1 call to `run_for` will take roughly 2.8 - 3.2 seconds on an Anvil cpu. Take note that we currently have 1 cpu for this notebook. - -[source,python] ----- -import time -import joblib -from joblib import Parallel -from joblib import delayed - -def run_for(): - var = 0 - while var < 11**7.5: - var += 1 - - print(var) ----- - -[source,ipython] ----- -%%time -test = [run_for() for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(run_for)() for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(run_for)() for i in range(4)) ----- - -Were you correct? Great! We only have 1 cpu, so regardless if we chose to use 2 threads or 2 processes, only 1 cpu would be used and 1 thing executed at a time. - -**threading:** This backend for `joblib` will use threads to run the tasks. Even though we only have a single cpu, we can still create as many threads as we want, however, due to Python's GIL (Global Interpreter Lock), only 1 thread can execute at a time. - -**multiprocessing:** This backend for `joblib` will use processes to run the tasks. In the same way we can create as many threads as we want, we can also create as many processes as we want. A _process_ is created by an os function called `fork()`. A _process_ can have 1 or more _threads_ or _threads of execution_, in fact, typically a process must have at least 1 _thread_. _Threads_ are much faster and take fewer resources to create. Instead of `fork()` a thread is created by `clone()`. A single cpu can have multiple processes or threads, but can only execute 1 task at a time. As a result, we end up with the same amount of time used to run. - -When writing a program, you could make your program create a process that spawns multiple threads. Those threads could then each run in parallel, 1 per cpu. Alternatively, you could write a program that has a single thread of execution, and choose to execute the program _n_ times creating _n_ processes that each run in parallel, 1 per cpu. The advantage of the former is that threads are lighter weight and take less resources to create, an advantage of the latter is that you could more easily distribute such a program onto many systems to run without having to worry about how many threads to spawn based on how many cpus you have available. - -Okay, let's take a look at this next example. Run the following (still with just 1 cpu). - -[source,ipython] ----- -%%time -test = [time.sleep(2) for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(time.sleep)(2) for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(time.sleep)(2) for i in range(4)) ----- - -Did you get it right this time? If not, it is most likely that you thought all 3 would take about 8 seconds. We only have 1 cpu, after all. Let's try to explain. - -**threading:** Like we mentioned before, due to Python's GIL, we can only execute 1 thread at a time. So why did our example only take about 2 seconds if only 1 thread can execute at a time? `time.sleep` is a function that will release Python's GIL (Global Interpreter Lock) because it is not actually utilizing the CPU while sleeping. It is _not_ the same as running an intensive loop for 2 seconds (like our previous example). Therefore the first thread can execute, the GIL is released, the second thread begins execution, rinse and repeat. The only execution that occurs is each thread consecutively starting `time.sleep`. Then, after about 2 seconds all 4 `time.sleep` calls are done, even though the cpu was not utilized much at all. - -**multiprocessing:** In this case, we are bypassing the restrictions that the GIL imposes on threads, BUT, `time.sleep` still doesn't need the cpu cycles to run, so the end result is the same as the threading backend, and all calls "run" at the same time. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Okay, let's try something! Save your notebook (and output from question 1), and completely close and delete your ondemand session. Then, launch a new notebook, but instead of choosing 1 core, choose 4. Run the following code, but before you do, guess how much time each will take to run. - -[source,python] ----- -import time -import joblib -from joblib import Parallel -from joblib import delayed - -def run_for(): - var = 0 - while var < 11**7.5: - var += 1 - - print(var) ----- - -[source,ipython] ----- -%%time -test = [run_for() for i in range(4)] ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="multiprocessing")(delayed(run_for)() for i in range(4)) ----- - -[source,ipython] ----- -%%time -test = Parallel(n_jobs=4, backend="threading")(delayed(run_for)() for i in range(4)) ----- - -How did you do this time? You may or may not have guessed, but the threading version took the same amount of time, but the multiprocessing backend was just about 4 times faster! What gives? - -Whereas Python's GIL will prevent more than a single thread from executing at a time, when `joblib` uses processes, it is not bound by the same rules. A _process_ is something created by the operating system that has its own address space, id, variables, heap, file descriptors, etc. As such, when `joblib` uses the multiprocessing backend, it creates new Python processes to work on the tasks, bypassing the GIL because it is _n_ separate processes and Python instances, not a single Python instance with _n_ threads of execution. - -In general, Python is not a good choice for writing a program that is best written using threads. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Okay, great, let me parallelize something! Okay, sounds good. - -The task is to count all of the lines in all of the files in `/anvil/projects/tdm/data/flights/subset/*.csv`, from the `1987.csv` to `2008.csv`, excluding all other csvs. - -First, write a non-parallelized solution that opens each file, counts the lines, adds the count to a total, closes the file, and repeats for all files. At the end, print the total number of lines. Put the code into a code cell and time the code cell using `%%time` magic. - -Now, write a parallelized solution that does the same thing. Put the code into a code cell and time the code cell using `%%time` magic. - -Make sure you are using a Jupyter Lab session with 4 cores. - -[TIP] -==== -Some optional tips: - -- Write a function that accepts an absolute path to a file (as a string), as well as an absolute path to a file in directory (as a string). -- The function should output the count of lines from the file represented by the first argument in the file specified in the second argument. -- Parallelize the function using `joblib`. -- After the `joblib` job is done, cycle through all of the output files, sum the counts, and print the total. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Parallelize the task and function that you have been writing about in the past 2 projects. If you are struggling or need help, be sure to ask for help in Piazza! If after further thinking, what you specified in the previous project is not easily parallelizable, feel free to change the task to some other, actually parallelizable task! - -Please time the task using `%%time` magic, both _before_ and _after_ parallelizing the task -- after all, its not any fun if you can't see the difference! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project04.adoc deleted file mode 100644 index f2f23c22b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project04.adoc +++ /dev/null @@ -1,145 +0,0 @@ -= TDM 30200: Project 4 -- 2023 - -== snow way, that is pretty quick - -**Motivation:** We got some exposure to parallelizing code in the previous project. Let's keep practicing in this project! - -**Context:** This is the last in a series of projects focused on parallelizing code using `joblib` and Python. - -**Scope:** Python, joblib - -.Learning Objectives -**** -- Demonstrate the ability to parallelize code using `joblib`. -- Identify and approximate the amount of time to process data after parallelization. -- Demonstrate the ability to scrape and process large amounts of data. -- Utilize `$SCRATCH` to store temporary data. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -Check out the data located here: https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ - -Normally, you are provided with clean, or semi clean sets of data, you read it in, do something, and call it a day. In this project, you are going to go get your own data, and although the questions won't be difficult, they will have less guidance than normal. Try and tap in to what you learned in previous projects, and of course, if you get stuck just shoot us a message in Piazza and we will help! - -As you can see, the yearly datasets vary greatly in size. What is the average size in MB of the datasets? How many datasets are there (excluding non year datasets)? What is the total download size (in GB)? Use the `requests` library and either `beautifulsoup4` or `lxml` to scrape and parse the webpage, so you can calculate these values. - -[CAUTION] -==== -Make sure to exclude any non-year files. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The `1893.csv.gz` dataset appears to be about 1/1000th of our total download size -- perfect! Use the `requests` package to download the file, write the file to disk, and time the operations using https://pypi.org/project/block-timer/[this package] (which is already installed). - -If you had a single CPU, approximately how long would it take to download and write all of the files (in minutes)? - -[TIP] -==== -The following is an example of how to write a scraped file to disk. - -[source,python] ----- -resp = requests.get(url) -with open("my_file.csv.gz", "wb") as f: - f.write(resp.content) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You can request up to 128 cores using OnDemand and our Jupyter Lab app. Save your work and request a new session with a minimum of 4 cores. Write parallelized code that downloads _all_ of the datasets (from 1750 to 2022) into your `$SCRATCH` directory. Before running the code, estimate the total amount of time this _should_ take, given the number of cores you plan to use. - -[WARNING] -==== -If you request more than 4 cores, **please** make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. -==== - -[CAUTION] -==== -There aren't datasets for 1751-1762 -- so be sure to handle this. Perhaps you could look at the `response.status_code` and make sure it is 200? -==== - -Time how long it takes to download and write all of the files. Was your estimation close (within a minute or two)? If not, have any theories as to why? You may get results that are _very_ variable, that is ok. Ultimately, we can only download as fast as that website will allow us to. Anytime you are dealing with another website or some factor outside of your control, unexpected behavior is to be expected. - -[TIP] -==== -Remember, your `$SCRATCH` directory is `/anvil/scratch/ALIAS` where `ALIAS` is your username, for example, `x-kamstut`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -In a previous question, I provided you with the code to actually extract and save content from our `requests` response. This is the sort of task that may not be obvious on how to do. Learning how to use search engines like Google or Kagi to figure this out is critical. - -Figure out how to extract the csv file from each of the datasets using Python. Write parallelized go that loops through and extracts all of the data. The end result should be 1 csv file per year. Like in the previous question, measure the time it takes to extract 1 csv, and attempt to estimate how long it should take to extract all of them. Time the extraction and compare your estimation to reality. Were you close? Note that with something like this, where it is purely computational, parallelization should be much more predictable than, for example, downloading files from a website. - -[WARNING] -==== -If you request more than 4 cores, **please** make sure to delete your Jupyter Lab session once your code has run and instead use a session with 4 or fewer cores. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Unzipped, your datasets total 100Gb! That is a lot of data! - -You can read https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] about what the data means. - -. 11 character station ID -. 8 character date in YYYYMMDD format -. 4 character element code (you can see the element codes https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here] in section III) -. value of the data (varies based on the element code) -. 1 character M-flag (10 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character Q-flag (14 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 1 character S-flag (30 possible values, see section III https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt[here]) -. 4 character observation time (HHMM) (0700 = 7:00 AM) -- may be blank - -It has (maybe?) been a snowy week, use your parallelization skills to figure out _something_ about snowfall. For example, maybe you want to find the last time or the last year which X amount of snow fell. Or maybe you want to find the station id for the location who has had the most instances of over X amount of snow. Get creative! You may create plots to supplement your work (if you want). - -Any good effort will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project05.adoc deleted file mode 100644 index b3100431e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project05.adoc +++ /dev/null @@ -1,325 +0,0 @@ -= TDM 30200: Project 5 -- 2023 - -**Motivation:** In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil. - -**Context:** This is the first in a series of 3 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, unix, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/coco/unlabeled2017/*.jpg` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -[IMPORTANT] -==== -This project (and the next) will have different types of deliverables. Each question will result in an entry in a Jupyter notebook, and/or 1 or more additional Python and/or Bash scripts. - -To properly save screenshots in your Jupyter notebook, follow the guidelines xref:projects:current-projects:templates.adoc#including-an-image-in-your-notebook[here]. Images that don't appear in your notebook in Gradescope will not get credit. -==== - -[WARNING] -==== -When you start your JupyterLab session this week, BEFORE you start your session, please change "Processor cores requested" from 1 to 4. We will use 4 processing cores this week. -==== - -Most of the supercomputers here at Purdue, and Anvil, contain one or more frontends. Users can log in and submit jobs to run on one or more backends. To submit a job, users use SLURM. - -SLURM is a job scheduler found on about 60% of the top 500 supercomputers.footnote:[https://en.wikipedia.org/wiki/Slurm_Workload_Manager[https://en.wikipedia.org/wiki/Slurm_Workload_Manager]] In this project (and the next) we will learn about ways to schedule jobs on SLURM, and learn the tools used. - -Let's get started by using a program called `salloc`. A brief explanation is that `salloc` gets some resources (think memory and cpus), and runs the commands specified by the user. If the user doesn't specify any commands, it will open the user's default shell (`bash`, `zsh`, `fish`, etc.) in the allocated resource. - -Open a terminal and give it a try. - -[source,bash] ----- -salloc -A cis220051 -p shared -n 3 -c 1 -t 00:05:00 --mem-per-cpu=1918 ----- - -After some output, you should notice that your shell changed. Type `hostname` followed by enter to see that your host has changed from `loginXX.anvil.rcac.purdue.edu` to `aXXX.anvil.rcac.purdue.edu`. You are in a different system! Very cool! - -To find out what the other options are read https://slurm.schedmd.com/salloc.html - -- The `-A cis220051` option could have also been written `--account=cis220051`. This indicates which account to use when allocating the resources (memory and cpus). You can also think of this as a "queue" or "the datamine queue". Jobs submitted using this option will use the resources we pay for. Only users with permissions can submit to our queue. -- The `-n 3` option could have also been written `--ntasks=3`. This indicates how many _tasks_ we may need for the job. -- The `-c 1` option could have also been written `--cpus-per-task=1`. This indicates the number of cores per _task_. -- The `-t 00:05:00` option could have also been written `--time=00:05:00`. This indicates how long the job may run for. If the time exceeds the time limit, the job is killed. -- The `--mem-per-cpu=1918` option indicates how much memory (in MB) we may need for each _cpu_ in the job. - -To confirm, use the following script to see how much memory and cpus we have available to us in this `salloc` session. Copy and paste the contents of this script in a file called `get_info.py` in your `$HOME` directory. After saved, make sure it is executable by running the following command. - -[source,bash] ----- -chmod +x $HOME/get_info.py ----- - -[source,python] ----- -#!/usr/bin/env python3 - -import socket -import os -from pathlib import Path -from datetime import datetime -import time - -def main(): - - time.sleep(5) - - print(f'Hostname: {socket.gethostname()}') - - with open("/proc/self/cgroup") as file: - for line in file: - if 'cpuset' in line: - cpu_loc = "cpuset" + line.split(":")[2].strip() - - if 'memory' in line: - mem_loc = "memory" + line.split(":")[2].strip() - - base_loc = Path("/sys/fs/cgroup/") - with open(base_loc / cpu_loc / "cpuset.cpus") as file: - num_cpu_sets = file.read().strip().split(",") - num_cpus = 0 - for s in num_cpu_sets: - if len(s.split("-")) == 1: - num_cpus += 1 - else: - num_cpus += (int(s.split("-")[1]) - int(s.split("-")[0]) + 1) - - print(f"CPUs: {num_cpus}") - - with open(base_loc / mem_loc / "memory.limit_in_bytes") as file: - mem_in_bytes = int(file.read().strip()) - print(f"Memory: {mem_in_bytes/1024**2} Mbs") - -if __name__ == "__main__": - print(f'started at: {datetime.now()}') - main() - print(f'finished at: {datetime.now()}') ----- - -To use it. - -[source,bash] ----- -~/get_info.py ----- - -For this question, add a screenshot of running `hostname` on the `salloc` session, as well as `~/get_info.py` to your notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -`salloc` can be useful, but most of the time we want to run a _job_. - -Before we get started, read the answer to https://stackoverflow.com/questions/46506784/how-do-the-terms-job-task-and-step-relate-to-each-other[this] stackoverflow post. In many instances, it is easiest to use 1 cpu per task, and let SLURM distribute those tasks to run. In this course, we will use this simplified model. - -So what is the difference between `srun` and `sbatch`? https://stackoverflow.com/questions/43767866/slurm-srun-vs-sbatch-and-their-parameters[This] stackoverflow post does a pretty great job explaining. You can think of `sbatch` as the tool for submitting a job script for execution, and `srun` as the tool to submit a job to run. We will test out both! - -In the previous question, we used `salloc` to get the resources, hop onto the system, and run `hostname` along with our `get_info.py` script. - -Use `srun` to run our `get_info.py` script, to better understand how the various options work. Try and guess the results of the script for each configuration. - -[TIP] -==== -Be sure to give you `get_info.py` script execution permissions if you haven't already. - -[source,bash] ----- -chmod +x get_info.py ----- -==== - -When inside a SLURM job, a variety of environment variables are set that alters how `srun` behaves. If you open a terminal from within Jupyter Lab and run the following, you will see. - -[source,bash] ----- -env | grep -i slurm ----- - -These variables altered the behavior of `srun`. We _can_ however, _unset_ these variables, and the behavior will revert to the default behavior. In your terminal, run the following. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Confirm that the environment variables are unset by running the following. - -[source,bash] ----- -env | grep -i slurm ----- - -[WARNING] -==== -You must repeat this process each new terminal you'd like to use within Jupyter Lab. This means that if you work on this project a while, and reopen it the next day to work on, you will need to repeat the bash command to remove the SLURM environment variables. -==== - -Great! Now, we can work in our nice Jupyter Lab environment without any concern that SLURM environment variables are changing any behaviors. Let's test it out with something _actually_ predictable. - -.first set of configurations to try ----- -srun -A cis220051 -p shared -n 2 -c 1 -t 00:00:05 $HOME/get_info.py -srun -A cis220051 -p shared -n 1 -c 2 -t 00:00:05 $HOME/get_info.py ----- - -[NOTE] -==== -Note that when using `-n 2 -c 1` it will create 2 _tasks_ that each run `$HOME/get_info.py`. Since there is 1 cpu per task, we get 2 cpus. If we used `-n 1 -c 2`, we would get 1 tasks that runs `$HOME/get_info.py` and since we requested 2 cpus per task, we get 2 cpus. -==== - -.second set of configurations to try ----- -srun -A cis220051 -p shared -n 1 -c 2 --mem=1918 -t 00:00:05 $HOME/get_info.py -srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py -srun -A cis220051 -p shared -n 2 -c 1 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py ----- - -[NOTE] -==== -Note how `--mem=1918` requests a total of 1918 MB of memory for the job. We end up getting the expected amount of memory for the last two `srun` commands as well. -==== - -.third set of configurations to try ----- -srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1918 -t 00:00:05 $HOME/get_info.py -srun -A cis220051 -p shared -n 1 -c 2 --mem-per-cpu=1919 -t 00:00:05 $HOME/get_info.py ----- - -[NOTE] -==== -Here, take careful note that when we increase our memory per cpu from 1918 to 1919 something important happens -- we are granted double the CPUs we requested! This is because, SLURM on Anvil is configured to give us at max 1918 MB of memory per CPU. If you request more memory, you will be granted additional CPUs. This is why https://ondemand.anvil.rcac.purdue.edu was recently configured to request only the number of cores you want -- because if you requested 1 core, but 4 GB of memory, you would get 3 cores, but only 4GB of memory, when you could have received 1918*3 = 5754 MB of memory instead of just 4 GB. -==== - -.fourth set of configurations to try ----- -srun -A cis220051 -p shared -n 3 -c 1 -t 00:00:05 $HOME/get_info.py > $SCRATCH/get_info.out ----- - -[NOTE] -==== -Check out the `get_info.py` script. SLURM on Anvil uses cgroups to manage resources. Some of the more typical commands used to find the number of cpus and amount of memory don't work accurately when "within" a cgroup. This script figures out which cgroups you are "in" and parses the appropriate files to get your resource limitations. -==== - -Reading the explanation from SLURM's website is likely not enough to understand, running the configurations will help your understanding. If you have simple, parallel processes, that doesn't need to have any shared state, you can use a single `srun` per task. Each with `--mem-per-cpu` (so memory availability is more predictable), `-n 1`, `-c 1`, followed by `&` (just a reminder that `&` at the end of a bash command puts the process in the background). - -Finally, take note of the last configuration. What is the `$SCRATCH` environment variable? - -For the answer to this question: - -. Add a screenshot of the results of some (not all) of you running the `get_info.py` script in the `srun` commands. -. Write 1-2 sentences about any observations. -. Include what the `$SCRATCH` environment variable is. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -The following is a solid template for a job script. - -.my_job.sh ----- -#!/bin/bash -#SBATCH --account=cis220051 <1> -#SBATCH --partition=shared <2> -#SBATCH --job-name=serial_job_test <3> -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) <4> -#SBATCH --mail-user=me@purdue.edu # Where to send mail <5> -#SBATCH --ntasks=3 # Number of tasks (total) <6> -#SBATCH --cpus-per-task=1 # Number of CPUs per task <7> -#SBATCH -o /dev/null # Output to dev null <8> -#SBATCH -e /dev/null # Error to dev null <9> - -srun -n 1 -c 1 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > first.out & <10> -srun -n 1 -c 2 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > second.out & <11> -srun -n 1 -c 3 --mem-per-cpu=1918 --exact -t 00:00:05 $HOME/get_info.py > third.out & <12> - -wait <13> ----- - -<1> Sets the account to use for billing -- in this case our account is cis220051. -<2> Sets the partition to use -- in this case we are using the shared partition. -<3> Give your job a unique name so you can identify it in the queue. -<4> Specify when you want to receive emails about your _job_. We have it set to notify us when the job ends or fails. -<5> Specify the email address to send the emails to. -<6> Specify the number of _tasks_, in total, to run within this job. -<7> Specify the number of _cores_ to use for each _task_. -<8> Redirect the output of the job to `/dev/null`. This is a special file that discards all output. You could change this to `$HOME/output.txt` and the contents would be written to that file. -<9> Redirect the error output of the job to `/dev/null`. This is a special file that discards all output. You could change this to `$HOME/error.txt` and the contents would be written to that file. -<10> The first _step_ of the _job_. This _step_ contains a single _task_, that uses a single _core_. -<11> The second _step_ of the _job_. This _step_ contains a single _task_, that uses two _cores_. -<12> The third _step_ of the _job_. This _step_ contains a single _task_, that uses three _cores_. -<13> Wait for all _steps_ to complete. Very important to include. - -Update the template to give your job a unique name, and to set the email to your Purdue email address. - -To submit a job, run the following. - -[source,bash] ----- -sbatch my_job.sh ----- - -Run the following experiments by tweaking `my_job.sh`, submitting the job using `sbatch`, and then checking the output of `first.out`, `second.out`, and `third.out`. - -. Run the original job script and note the time each of the steps finished relative to the other steps. -. Change the **job script** `--cpus-per-task` from 1 to 2. What happens to the finish times? -. Remove `--exact` from each of the **job steps**. What happens to the finish times? - -In addition, please feel free to experiment with the various values, and see how the values effect the finish times and/or output of our `get_info.py` script. Can you determine how things work? Write 1-2 sentences about your observations. Please do take the time to iterate on this question over and over until you get a good feel for how things work. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Make your job script run for at least 20 seconds -- you can do this by adding more steps, reducing cpus, or modifying the `time.sleep` call in the `get_info.py` script. Submit the job using `sbatch`. Immediately after submitting the job, use the built in `squeue` command, in combination with `grep` to find the job id of your job. - -[TIP] -==== -What is `squeue`? https://slurm.schedmd.com/squeue.html[Here] are the docs. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project06.adoc deleted file mode 100644 index bea71eff0..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project06.adoc +++ /dev/null @@ -1,339 +0,0 @@ -= TDM 30200: Project 6 -- 2023 - -**Motivation:** In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil. - -**Context:** This is the second in a series of (now) 4 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, unix, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -- Understand the differences between `srun` and `sbatch` commands. -- Predict the resources (cpus and memory) an `srun` job will use based on the arguments and context. -- Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/coco/unlabeled2017/*.jpg` -- `/anvil/projects/tdm/data/coco/attempt02/*.jpg` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -The more you practice the clearer your understanding will be. So we will be putting our new skills to use to solve a problem. - -We begin with a dataset full of images: `/anvil/projects/tdm/data/coco/unlabeled2017/*.jpg`. - -We know a picture of Dr. Ward is (naturally) included in the folder. The problem is, Dr. Ward is sneaky and he has added a duplicate image of himself in our dataset. This duplicate could cause problems and we need a clean dataset. - -It is time consuming and not best practice to manually go through the entire dataset to find the duplicate. Thinking back to some of the past work, we remember that a hash algorithm is a good way to identify the duplicate image. - -Below is code you could use to produce a hash of an image. - -[source,python] ----- -import hashlib - -with open("/anvil/projects/tdm/data/coco/unlabeled2017/000000000013.jpg", "rb") as f: - print(hashlib.sha256(f.read()).hexdigest()) ----- - -[NOTE] -==== -In general a hash function, is a function that takes an input and produces a unique "hash", or alphanumeric string. Meaning if you find two identical hashes, most likely you can assume that the inputs are identical. -==== - -By finding the hash of all of the images in the first folder, then using sets to quickly find the duplicate image. You can write a Python script that outputs a file containing the hash of each image - -[NOTE] -==== -For our example, the file `000000000013.jpg` has the hash `7ad591844b88ee711d1eb60c4ee6bb776c4795e9cb4616560cb26d2799493afe`. -==== - -Parallelize the file creating and search process will make finding the duplicate faster. - -[source,python] ----- -#!/usr/bin/python3 - -import os -import sys -import hashlib -import argparse - - -def hash_file_and_save(files, output_directory): - """ - Given an absolute path to a file, generate a hash of the file and save it - in the output directory with the same name as the original file. - """ - - for file in files: - file_name = os.path.basename(file) - file_hash = hashlib.sha256(open(file, "rb").read()).hexdigest() - output_file_path = os.path.join(output_directory, file_name) - with open(output_file_path, "w") as output_file: - output_file.write(file_hash) - - -def main(): - - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest='command') - hash_parser = subparsers.add_parser("hash", help="generate and save hash") - hash_parser.add_argument("files", help="files to hash", nargs="+") - hash_parser.add_argument("-o", "--output", help="directory to output file to", required=True) - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "hash": - hash_file_and_save(args.files, args.output) - -if __name__ == "__main__": - main() ----- - -Quickly recognizing that it is not efficient to have an `srun` command for each image, you'd have to programmatically build the job script, also the script runs quickly so there would be a rapid build up wasted time with overhead related to calling `srun`, allocating resources, etc. Instead for efficency create a job script that splits the images into groups of 12500 or less. Then, within 10 `srun` commands you will be able to use the provided Python script to process the 12500 images. - -The Python script we've provided works as follows. - -[source,bash] ----- -./hash.py hash --output /path/to/outputfiles/ /path/to/image1.jpg /path/to/image2.jpg ----- - -The above command will generate a hash of the two images (although there could be _n_ images provided) and save the hash in the output directory with the same name as the original image. For example, the following command will calculate the hash of the image `000000000013.jpg` and save it in a file named `000000000013.jpg` in the `$SCRATCH` directory. This file is **not** an image -- it is a _text_ file containing the hash, `7ad591844b88ee711d1eb60c4ee6bb776c4795e9cb4616560cb26d2799493afe`. You can see this by running `cat $SCRATCH/000000000013.jpg`. - -[source,python] ----- -./hash.py hash --output $SCRATCH /anvil/projects/tdm/data/coco/unlabeled2017/000000000013.jpg ----- - -[IMPORTANT] -==== -You'll need to give execute permissions to your `hash.py` script. You can do this with `chmod +x hash.py`. -==== - -[TIP] -==== -https://stackoverflow.com/questions/21668471/bash-script-create-array-of-all-files-in-a-directory[This] stackoverflow post shows how to get a Bash array full of absolute paths to files in a folder. -==== - -[TIP] -==== -To pass many arguments (_n_ arguments) to our Python script, you can `./hash.py hash --output /path/to/outputfiles/ ${my_array[@]}`. -==== - -[TIP] -==== -https://stackoverflow.com/questions/23747612/how-do-you-break-an-array-in-groups-of-n[This] stackoverflow post shows how to break an array of values into groups of _x_. -==== - -[TIP] -==== -Don't forget to clear out the SLURM environment variables in any new terminal session: - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- -==== - -Create a job script that processes all of the images in the folder, and outputs the hash of each image into a file with the same name as the original image. Output these files into a folder in `$SCRATCH`, so, for example, `$SCRATCH/q1output`. You will likely want to create the `q1output` directory before running your job script. - -[NOTE] -==== -This job took about 3 minutes and 32 seconds to run. Finding the duplicate image took about 36 seconds. -==== - -Once the images are all hashed, in your Jupyter notebook, write Python code that processes all of the hashes (by reading the files you've saved in `$SCRATCH/q1output`) and prints out the name of one of the duplicate images. Display the image in your notebook using the following code. - -[source,python] ----- -from IPython import display -display.Image("/path/to/duplicate_image.jpg") ----- - -To answer this question, submit the functioning job script AND the code in the Jupyter notebook that was used to find (and display) the duplicate image. - -[TIP] -==== -Using sets will help find the duplicate image. One set can store new hashes that haven't yet been seen. The other set can store duplicates, since there is only 1 duplicate you can immediately return the filename when found! - -https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them[This] stackoverflow post shares some ideas to manage this. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In the previous question, you were able to use the sha256 hash to efficiently find the extra image that the trickster Dr. Ward added to our dataset. Dr. Ward, knowing all about hashing algorithms, thinks he has a simple solution to circumventing your work. In the "new" dataset: `/anvil/projects/tdm/data/coco/attempt02`, he has modified the value of a single pixel of his duplicate image. - -Re-run your SLURM job from the previous question on the _new_ dataset, and process the results to try to find the duplicate image. Was Dr. Ward's modification successful? Do your best to explain why or why not. - -[TIP] -==== -I would start by creating a new folder in `$SCRATCH` to store the new hashes. - -[source,bash] ----- -mkdir $SCRATCH/q2output ----- - -Next, I would update the job script to output files to the new directory, and change the directory of the input files to the new dataset. -==== - -[NOTE] -==== -If at this point in time you are wondering "why would we do this when we can just use `joblib` and get 128 cores and power through some job?". The answer is because `joblib` will be limited to the number of cpus on the given node you are running your Python script on. SLURM allows us to allocate _well_ over 128 cpus, and has much higher computing potential! In addition to that, it is (arguably) easier to write a single threaded Python job to run on SLURM, than to parallelize your code using `joblib`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Unfortunately, Dr. Ward was right, and our methodology didn't work. Luckily, there is a cool technique called perceptual hashing that is _almost_ meant just for this! Perceptual hashing is a technique that can be used to know whether or not any two images appear the same, without actually _viewing_ the images. The general idea is this. Given two images that are _essentially_ the same (maybe they have a few different pixels, have been cropped, gone through a filter, etc.), a perceptual hash can give you a very good idea whether the images are the "same" (or close enough). Of course, it is not a perfect tool, but most likely good enough for our purposes. - -To be a little more specific, two images are very likely the same if their perceptual hashes are the same. If two perceptual hashes are the same, their Hamming distance is 0. For example, if your hashes were: `8f373714acfcf4d0` and `8f373714acfcf4d0`, the Hamming distance would be 0, because if you convert the hexadecimal values to binary, at each position in the string of 0s and 1s, the values are identical. If 1 of the 0s and 1s didnt match after converting to binary, this would be a Hamming distance of 1. - -Use the https://github.com/JohannesBuchner/imagehash[`imagehash`] library, and modify your job script from the previous project to use perceptual hashing instead of the sha256 algorithm to produce 1 file for each image where the filename remains the same as the original image, and the contents of the file contains the hash. - -[WARNING] -==== -Make sure to clear out your slurm environment variables before submitting your job to run with `sbatch`. If you are submitting the job from a terminal, run the following. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; -sbatch my_job.sh ----- - -If you are in a bash cell in Jupyter Lab, do the same. - -[source,ipython] ----- -%%bash - -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; -sbatch my_job.sh ----- -==== - -[IMPORTANT] -==== -In order for the `imagehash` library to work, we need to make sure the dependencies are loaded up. To do this, we will use the container where our environment is stored: - -[source,bash] ----- -#!/bin/bash -#SBATCH --account=datamine -...other SBATCH options... - -srun ... singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 /path/to/new/hash.py & - -wait ----- -==== - -[TIP] -==== -To help get you going using this package, let me demonstrate using the package. - -[source,python] ----- -import imagehash -from PIL import Image - -my_hash = imagehash.phash(Image.open("/anvil/projects/tdm/data/coco/attempt02/000000000008.jpg")) -print(my_hash) # d16c8e9fe1600a9f -my_hash # numpy array of True (1) and False (0) values -my_hex = "d16c8e9fe1600a9f" -imagehash.hex_to_hash(my_hex) # numpy array of True (1) and False (0) ----- -==== - -[IMPORTANT] -==== -Make sure that you pass the hash as a string to the `output_file.write` method. So something like: `output_file.write(str(file_hash))`. -==== - -[IMPORTANT] -==== -Make sure that once you've written your script, `my_script.sh`, that you submit it to SLURM using `sbatch my_script.sh`, _not_ `./my_script.sh`. -==== - -[TIP] -==== -It would be a good idea to make sure you've modified your hash script to work properly with the `imagehash` library. Test out the script by running the following (assuming your Python code is called `hash.py`, and it is in your `$HOME` directory. - -[source,bash] ----- -$HOME/hash.py hash --output $HOME /anvil/projects/tdm/data/coco/attempt02/000000000008.jpg ----- - -This should produce a file, `$HOME/000000000008.jpg`, containing the hash of the image. -==== - -[WARNING] -==== -Make sure your `hash.py` script has execute permissions! - -[source,bash] ----- -chmod +x $HOME/hash.py ----- -==== - -Process the results. Did you find the duplicate image? Explain what you think could have happened. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -What!?! That is pretty cool! You found the "wrong" duplicate image? Well, I guess it is totally fine to find multiple duplicates. Modify the code you used to find the duplicates so it finds all of the duplicates and originals. In total there should be 50. Display 2-5 of the pairs (or triplets or more). Can you see any of the subtle differences? Hopefully you find the results to be pretty cool! If you look, you _will_ find Dr. Wards hidden picture, but you do not have to exhaustively display all 50 images. - -[WARNING] -==== -Please turn in all 3 job scripts (for questions 1-3). Please turn in both `hash.py` files (for questions 2-3). Please turn in your Jupyter Notebook that demonstrates finding the duplicates for questions 1 and 3, and 4. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project07.adoc deleted file mode 100644 index 7991a8494..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project07.adoc +++ /dev/null @@ -1,64 +0,0 @@ -= TDM 30200: Project 7 -- 2023 - -**Motivation:** In this project we will slowly get familiar with SLURM, the job scheduler installed on Anvil. - -**Context:** This is the third in a series of 4 projects focused on parallel computing using SLURM and Python. - -**Scope:** SLURM, UNIX, Python - -.Learning Objectives -**** -- Use basic SLURM commands to submit jobs, check job status, kill jobs, and more. -- Understand the differences between srun and sbatch commands. -- Predict the resources (cpus and memory) an srun job will use based on the arguments and context. -- Write and use a job script to solve a problem faster than you would be able to without a high performance computing (HPC) system. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -You are free to use _any_ dataset you would like for this project, even if the data is created or collected by you. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -You've been exposed to a lot about SLURM in a short period of time. For this last question, we are going to let you go at your own pace. - -Think of a problem that you want to solve that may benefit from parallel computing and SLURM. It could be anything: processing many images in some way (counting pixels, applying filters, analyzing for certain objects, etc.), running many simulations to plot, bootstrapping a model to get point estimates for uncertainty quantification, calculating summary information about a large dataset, trying to guess a 6 character password, calculating the Hamming distance between all of the 123k images in the `coco/hashed02` dataset, etc. - -Solve your problem, or make progress towards solving your problem. The following are the loose requirements. As long as you meet these requirements, you will receive full credit. The idea is to get some experience, and have some fun. - -**Requirements:** - -. You must have an introductory paragraph clearly explaining your problem, and how you think using a cluster and SLURM can help you solve it. -. You must submit any and all code you wrote. It could be in any language you want, just put it in a code block in a Markdown cell. -. You must write and submit a job script to be submitted using `sbatch` on SLURM. This could be copy and pasted into a code block in a markdown cell. -. You must measure the time it takes to run your code on a sample of your data, and make a prediction for how long it will take using SLURM, based on the resources you requested in your job script. Write 1-2 sentences explaining how long the sample took and the math you used to predict how long you think SLURM will take. -. You must write 1-2 sentences explaining how close or far away your prediction was from the actual run time. - -The above requirements should be all kept in a Jupyter notebook. The notebook should take advantage of markdown formatting, and narrate a clear story, with a clear objective, and explanations of any struggles or issues you ran into along the way. - -[IMPORTANT] -==== -To not hammer our resources _too_ much, please don't request more than 32 cores, and if you use more than 10 cores, please make sure your jobs don't take _tons_ of time. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project08.adoc deleted file mode 100644 index 4dfb9226f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project08.adoc +++ /dev/null @@ -1,328 +0,0 @@ -= TDM 30200: Project 8 -- 2023 -:page-mathjax: true - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries -- `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the first in a series of 4-5 projects focused on `pytorch` (and perhaps `JAX`). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show _why_ they are useful, without needing any special math or statistics background. - -**Scope:** Python, pytorch - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the `pytorch` API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use `pytorch` to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/sim/train.csv` -- `/anvil/projects/tdm/data/sim/test.csv` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -While data frames are a great way to work with data, they are not the only way. Many high performance parts of code are written using a library like `numpy` or `pytorch`. These libraries are optimized to be extremely efficient with computation. Multiplying, transposing, inversing matrices can take time, and these libraries can make you code run blazingly fast. - -This project is the first project in a series of projects focused on the `pytorch` library. It is difficult to understand why a library like `pytorch` is useful without introducing _some_ math. This series of projects will involve some math, however, only at a very high level. Some intuition will be presented as notes, but what is really needed is the ability to read some formulas, and perform the appropriate computations. Throughout this series of projects, we will do our best to ensure that math or statistics is not at all a barrier to completing these projects and getting familiar with `pytorch`. If it does end up an issue, please post in Piazza and we will do our best to address any issues as soon as possible. - -This first project will start slowly, and only focus on the `numpy` -like functionality of `pytorch`. We've provided you with a set of 100 observations. 75 of the observations are in the `train.csv` file, 25 are in the `test.csv` file. We will build a regression model using the data in the `train.csv` file. In addition, we will calculate some other statistics. Finally, we will (optionally) test our model out on new data in the `test.csv` dataset. - -Start by reading the `train.csv` file into a `pytorch` tensor. - -[TIP] -==== -[source,python] ----- -import torch -import pandas as pd - -dat = pd.read_csv('/anvil/projects/tdm/data/sim/train.csv') -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) ----- -==== - -[NOTE] -==== -A tensor is just a n-dimensional array. -==== - -Use `matplotlib` or `plotly` to plot the data on a scatterplot -- `x_train` on the x-axis, and `y_train` on the y-axis. After talking to your colleague, you agreed that the data is clearly following a 2nd order polynomial. Something like: - -$y = \beta_0 + \beta_1x + \beta_2x^2$ - -Our goal will be to estimate the values of $\beta_0$, $\beta_1$, and $\beta_2$ using the data in `x_train` and `y_train`. Then, we will have a model that could look something like: - -$y = 1.2 + .4x + 2.2x^2$ - -Then, for any given value of x, we can use our model to predict the value of y. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In order to build our model, we need to estimate our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. Luckily, linear regression has a closed form solution, so we can calculate these values directly with the following equation. - -$\hat{\beta} = (X^{T} X)^{-1} X^{T} y$ - -What do these symbols mean? X is a matrix (or tensor), where each column is a term in the polynomial, and each row is an observation. So, for our polynomial, if our X data was simply: 1, 2, 3, 4, the X matrix (or design matrix) would be the following: - -.X ----- -1, 1, 1 -1, 2, 4 -1, 3, 9 -1, 4, 16 ----- - -Here, the first column is the constant term, the second column is the term of x, the third column is the term of $x^2$, and so on. - -When we raise the matrix to the "T" this means to transpose the matrix. The transpose of X, for example, would look like: - -.X^T ----- -1, 1, 1, 1 -1, 2, 3, 4 -1, 3, 9, 16 ----- - -When we raise the matrix to the "-1" this means to invert the matrix. - -Finally, placing these matrices next to each other means we need to perform matrix multiplication. - -`pytorch` has built in functions to do all of these operations: `torch.mm`, `mat.T`, and `torch.inverse`. - -Lastly, `y` is the tensor containing the observations in `y_train`. - -[IMPORTANT] -==== -Tensors must be the correct dimensions before they can be multiplied together using `torch.mm`. By default, `x_train` and `y_train` will be a single row and 75 columns. In order to change this to be a single column and 75 rows, we would need to use the `reshape` method: `x_train.reshape(75,1)`. - -When doing matrix multiplication, it is important that the tensors are aligned properly. A 4x1 matrix would be a matrix that has 4 rows and 1 column (the first number always represents the number of rows, the second always represents the number of columns). - -In order to multiply 2 matrices together, the number of columns in the first matrix must equal the number of rows in the second matrix. The resulting matrix would then have the number of rows as the first matrix, and the number of columns of the second matrix. So, if we multiplied a 4x3 matrix with a 3x5 matrix, the result would be a 4x5 matrix. - -These rules are important, because the tensors must be the correct shape (correct number of rows and columns) before we perform matrix multiplication, otherwise we will get an error. - -The `reshape` method allows you to specify the number of rows and columns in the tensor, for example, `x_train.reshape(75,1)`, would result in a matrix with 75 rows and a single column. You will need to be careful to make sure your tensors are the correct shape before multiplication. -==== - -Start by creating a new tensor called `x_mat` that is 75 rows and 3 columns. The first column should be filled with 1's (using `torch.ones(x_train.shape[0]).reshape(75,1)`), the second column should be the values in `x_train`, the third column should be the values in `x_train` squared. Use `torch.cat` to combine the 75x1 tensors into a single 75x3 tensor (`x_mat`). - -[IMPORTANT] -==== -Make sure you reshape all of your tensors to be 75x1 _before_ you use `torch.cat` to combine them into a 75x3 tensor. -==== - -[TIP] -==== -Operations like addition and subtraction are vectorized. For example, the following would result in a 75x1 tensor of 2's. - -[source,python] ----- -x = torch.ones(75,1) -x*2 ----- - -The following would result in a 1x75 tensor of .5's. - -[source,python] ----- -x = torch.ones(1,75) -x/2 ----- -==== - -[TIP] -==== -Remember, in Python, you can use: - -[source,python] ----- -** ----- - -to raise a number to a power. For example $2^3$ would be - -[source,python] ----- -2**3 ----- -==== - -[TIP] -==== -To get the transpose of a tensor 2 dimension tensor in `pytorch` you could use `x_mat.T`, or `torch.transpose(x_mat, 0, 1)`, where 0 is the first dimension to transpose and 1 is the second dimension to transpose. -==== - -Calculate our estimates for $\beta_0$, $\beta_1$, and $\beta_2$, and save the values in a tensor called `betas`. The following should be the successful result. - -.results ----- -tensor([[ 4.3677], - [-1.7885], - [ 0.4840]], dtype=torch.float64) ----- - -Now that you know the values for $\beta_0$, $\beta_1$, and $\beta_2$, what is our model (as an equation)? It should be: - -$y = 4.3677-1.7885x+.4840x^2$ - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -That is pretty cool, and very fast. Now, for any given value of x, we can predict a value of y. Of course, we _could_ write a predict function that accepts a value x, and returns our prediction y, and apply that function to each of the x values in our `x_train` tensor, however, this can be accomplished even faster and more flexibly using matrix multiplication -- simply use the following formula: - -$\hat{y} = X\hat{\beta}$ - -Where X is the `x_mat` tensor from earlier, and $\hat{\beta}$ is the `betas` tensor from question (2). Use `torch.mm` to multiply the two matrices together. Save the resulting tensor to a variable called `y_predictions`. Finally, create side by side scatterplots. In the first scatterplot, put the values in `x_train` on the x-axis and the values of `y_train` on the y-axis. In the second scatterplot put the values of `x_train` on the x-axis, and your predictions (`y_predictions`) on the y-axis. - -Very cool! Your model should be killing it (after all, we generated this data to follow a known distribution). - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -To better understand our model, let us create one of the most common forms of uncertainty quantification, confidence intervals. Confidence intervals (95% confidence intervals) show you the range of values (for each x) where we are 95% confident that the average value y for a given x is within the range. - -The formula is the following: - -$\hat{y_h} \pm t_{(\alpha/2, n-p)} * \sqrt{MSE * diag(x_h(X^{T} X)^{-1} x_h^{T})}$ - -$MSE = \frac{1}{n-p}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -Since we are calculating the 95% confidence interval for the values of x in our `x_train` tensor, we can simplify this to: - -$\hat{Y} \pm 1.993464 * \sqrt{MSE * diag(X(X^{T} X)^{-1} X^{T})}$ - -$\frac{1}{72}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -Where: - -- $\hat{Y}$ is our `y_predictions` tensor from question (3). -- $Y_i$ is the value of y for the ith value of `y_train`. -- $\hat{Y_i}$ is the value of y for the ith value of `y_predictions`. -+ -[TIP] -==== -You could simply sum the results of subtracting the `y_predictions` tensor from the `y_train` tensor, squared. You don't need any loop. -==== -+ -- p is the number of parameters in our model (3, the constant, the x, and the $x^2$). -- n is the number of observations in our data set (75). - -[TIP] -==== -The "diag" part of the formula indicates that we want the _diagonal_ of the resulting matrix. The diagonal of a given nxn matrix is the value at location (1,1), (2,2), (3,3), ..., (n,n). So, for instance, the diagonal of the following matrix is: 1, 5, 9 - -.matrix ----- -1,2,3 -4,5,6 -7,8,9 ----- - -In `pytorch`, you can get this using `torch.diag(x)`, where x is the matrix you want the diagonal of. - -[source,python] ----- -test = torch.tensor([1,2,3,4,5,6,7,8,9]).reshape(3,3) -torch.diag(test) ----- -==== - -[TIP] -==== -You can use `torch.sum` to sum up the values in a tensor. -==== - -[TIP] -==== -The value for MSE should be 135.5434. - -The first 5 values of the `upper` confidence interval are: - -.upper ----- -tensor([[171.3263], - [ 91.9131], - [ 83.3474], - [ 63.8171], - [ 63.0524]], dtype=torch.float64) ----- - -The first 5 values of the `lower` confidence interval are: - -.lower ----- -tensor([[140.6660], - [ 76.2350], - [ 69.1461], - [ 52.7601], - [ 52.1101]], dtype=torch.float64) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Create a scatterplot of `x_train` on the x-axis, and `y_predictions` on the y-axis. Add the confidence intervals to the plot. - -Great! It is unsurprising that our model is a great fit. - -[TIP] -==== -See https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.fill_between.html[here] for the documentation on `fill_between`. This function can be used to shade from the lower to upper confidence bounds. Use this function after you've https://matplotlib.org/3.5.1/api/_as_gen/matplotlib.pyplot.plot.html[plotted] your values of x (`x_mat[:, 1]`) on the x-axis and values of `y_predictions` on your y-axis. -==== - -[NOTE] -==== -In this project, we explored a well known model using simulated data from a known distribution. It is pretty boring, but boring can also make things a bit easier to understand. - -To give a bit of perspective, this project focused on tensor operations so you could get used to `pytorch`. The power of `pytorch` starts to really show itself when the problems do not have a closed form solution. In the _next_ project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent, and algorithms like it are used frequently, it will give you a good sense on _why_ `pytorch` is useful. In addition, because we solved this problem using the closed form solutions, we will be able to easily verify that our work in the _next_ project is working as intended! - -Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertaintly quantification measures. We will use SLURM in combination with `pytorch` to resample our data and calculate point estimates, which can then be used to understand the variability. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project09.adoc deleted file mode 100644 index b4df05d60..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project09.adoc +++ /dev/null @@ -1,369 +0,0 @@ -= TDM 30200: Project 9 -- 2023 -:page-mathjax: true - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries — `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the second in a series of 4-5 projects focused on pytorch (and perhaps JAX). The purpose of these projects is to give you exposure to these tools, some basic functionality, and to show why they are useful, without needing any special math or statistics background. - -**Scope:** Python, pytorch - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the `pytorch` API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use `pytorch` to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/sim/train.csv` -- `/anvil/projects/tdm/data/sim/test.csv` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -[WARNING] -==== -If you did not attempt the previous project, some of the novelty of `pytorch` may be lost. The following is a note found at the end of question 5 from the previous project. - -In this project, we explored a well known model using simulated data from a known distribution. It was pretty boring, but boring can also make things a bit easier to understand. - -To give a bit of perspective, the previous project focused on tensor operations so you could get used to `pytorch`. The power of `pytorch` really starts to show itself when the problem you are facing does not have a closed form solution. In _this_ project, we will use an algorithm called gradient descent to estimate our parameters (instead of using the closed form solutions). Since gradient descent is an algorithm and not a technique that offers a simple closed form solutions, and algorithms like gradient descent are used frequently, this project will _hopefully_ give you a good sense on _why_ `pytorch` is useful. In addition, since we fit a regression model using a closed form solution in the previous project, we will be able to easily verify that our work in _this_ project is working as intended! - -Lastly, in more complex situations, you may not have formulas to calculate confidence intervals and other uncertainty quantification measures. In the _next_ project, we will use SLURM in combination with `pytorch` to re-sample our data and calculate point estimates, which can then be used to understand the variability. -==== - -[NOTE] -==== -This project will _show_ more calculus than you need to know or understand for this course. It is included for those who are interested, and so the reader can see "oh my, that is a lot of work we are avoiding!". Don't worry _at all_, is is not necessary to understand for this course. -==== - -Start by reading in your `train.csv` data into tensors called `x_train` and `y_train`. - -[source,python] ----- -import pandas as pd -import torch - -dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv") -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) ----- - -In the previous project, we estimated the parameters of our regression model using a closed form solution. What does this do? At the heart of the regression model, we are _minimizing_ our _loss_. Typically, this _loss_ is the mean squared error (MSE). The formula for MSE is: - -$MSE = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \hat{Y_i})^2$ - -[NOTE] -==== -You can think of MSE as the difference between the actual y values (from the training data) and the y values our model predicts, squared, summed, and then divided by $n$, or the number of observations. Larger differences, say a difference of 10, is given a stronger penalty (100) than say, a difference of 5 (25). In this way, MSE as the loss function, tries to make the _overall_ predictions good. -==== - -Using our closed form solution formulas, we can calculate the parameters such that the MSE is minimized over the entirety of our training data. This time, we will use gradient descent to iteratively calculate our parameter estimates! - -By plotting our data, we can see that our data is parabolic and follows the general form: - -$y = \beta_{0} + \beta_{1} x + \beta_{2} x^{2}$ - -If we substitute this into our formula for MSE, we get: - -$MSE = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - ( \beta_{0} + \beta_{1} x_{i} + \beta_{2} x_{i}^{2} ) )^{2} = \frac{1}{n} \sum_{i=1}^{n} ( Y_{i} - \beta_{0} - \beta_{1} x_{i} - \beta_{2} x_{i}^{2} )^{2}$ - -The first step in gradient descent is to calculate the partial derivatives with respect to each of our parameters: $\beta_0$, $\beta_1$, and $\beta_2$. - -These derivatives will let us know the _slope_ of the tangent line for the given parameter with the given value. We can then _use_ this slope to adjust our parameter, and eventually reach a parameter value that minimizes our _loss_ function. Here is the calculus. - -$\frac{\partial MSE}{\partial \beta_0} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_0}$ - -$\frac{\partial MSE}{\partial \hat{y_i}} = 1$ - -$\frac{\partial \hat{y_i}}{\beta_0} = 2(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_1}$ - -$\frac{\partial \hat{y_i}}{\partial \beta_1} = 2x(\beta_0 + \beta_1x + \beta_2x^2 - y_i)$ - -$\frac{\partial MSE}{\partial \beta_2} = \frac{\partial MSE}{\partial \hat{y_i}} * \frac{\partial \hat{y_i}}{\partial \beta_2}$ - -$\frac{\partial \hat{y_{i}}}{\partial \beta_{2}} = 2x^{2} (\beta_{0} + \beta_{1} x + \beta2_x^{2} - y_{i})$ - -If we clean things up a bit, we can see that the partial derivatives are: - -$\frac{\partial MSE}{\partial \beta_0} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_1} = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \beta_0 - \beta_1 - \beta_2x^2) = \frac{-2}{n}\sum_{i=1}^{n}x(y_i - \hat{y_i})$ - -$\frac{\partial MSE}{\partial \beta_{2}} = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \beta_{0} - \beta_{1} - \beta_{2} x^{2}) = \frac{-2}{n}\sum_{i=1}^{n} x^{2} (y_{i} - \hat{y_{i}})$ - -Pick 3 random values -- 1 for each parameter, $\beta_0$, $\beta_1$, and $\beta_2$. For consistency, lets try 5, 4, and 3 respectively. These values will be our random "guess" as to the actual values of our parameters. Using those starting values, calculate the partial derivitive for each parameter. - -[TIP] -==== -Start by calculating `y_predictions` using the formula: $\beta_0 + \beta_1x + \beta_2x^2$, where $x$ is your `x_train` tensor! -==== - -[TIP] -==== -You should now have tensors `x_train`, `y_train`, and `y_predictions`. You can create another new tensor called `error` by subtracting `y_predictions` from `y_train`. -==== - -[TIP] -==== -You can use your tensors and the `mean` method to (help) calculate each of these partial derivatives! Note that these values could vary from person to person depending on the random starting values you gave each of your parameters. -==== - -Okay, once you have your 3 partial derivatives, we can _update_ our 3 parameters using those values! Remember, those values are the _slope_ of the tangent line for each of the parameters for the corresponding parameter value. If by _increasing_ a parameter value we _increase_ our MSE, then we want to _decrease_ our parameter value as this will _decrease_ our MSE. If by _increasing_ a parameter value we _decrease_ our MSE, then we want to _increase_ our parameter value as this will _decrease_ our MSE. This can be represented, for example, by the following: - -$\beta_0 = \beta_0 - \frac{\partial MSE}{\partial \beta_0}$ - -This will however potentially result in too big of a "jump" in our parameter value -- we may skip over the value of $\beta_0$ for which our MSE is minimized (this is no good). In order to "fix" this, we introduce a "learning rate", often shown as $\eta$. This learning rate can be tweaked to either ensure we don't make too big of a "jump" by setting it to be small, or by making it a bit larger, increasing the speed at which we _converge_ to a value of $\beta_0$ for which our MSE is minimized, at the risk of having the issue of over jumping. - -$\beta_0 = \beta_0 - \eta \frac{\partial MSE}{\partial \beta_0}$ - -Update your 3 parameters (once) using a learning rate of $\eta = 0.0003$. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Woohoo! That was a _lot_ of work for what ended up being some pretty straightforward calculations. The previous question represented a single _epoch_. You can define the number of epochs yourself, the idea is that _hopefully_ after all of your epochs, the parameters will have converged, leaving your with the parameter estimates you can use to calculate predictions! - -Write code that runs 10000 epochs, updating your parameters as it goes. In addition, include code in your loops that prints out the MSE every 100th epoch. Remember, we are trying to _minimize_ our MSE -- so we would expect that the MSE _decreases_ each epoch. - -Print the final values of your parameters -- are the values close to the values you estimated in the previous project? - -In addition, approximately how many epochs did it take for the MSE to stop decreasing by a significant amount? Based on that result, do you think we could have run fewer epochs? - -[NOTE] -==== -Mess around with the starting values of your parameters, and the learning rate. You will quickly notice that bad starting values can result in final results that are not very good. A learning rate that is too large will diverge, resulting in `nan`. A learning rate that is too small won't learn fast enough resulting in parameter values that aren't accurate. - -The learning rate is a hyperparameter -- a parameter that is chosen _before_ the training process begins. The number of epochs is also a hyperparameter. Choosing good hyperparameters can be critical, and there are a variety of methods to help "tune" hyperparameters. For this project, we know that these values work well. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -You may be wondering think at this point that `pytorch` has been pretty worthless, and it still doesn't make any sense how this simplifies anything. There was too much math, and we still performed a bunch of vector/tensor/matrix operations -- what gives? Well, while this is all true, we haven't utilized `pytorch` quite yet, but we are going to here soon. - -First, let's cover some common terminology you may run across. In each epoch, when we calculate the newest predictions for our most up-to-date parameter values, we are performing the _forward pass_. - -There is a similarly named _backward pass_ that refers (roughly) to the step where the partial derivatives are calculated! Great. - -`pytorch` can perform the _backward pass_ for you, automatically, from our MSE. For example, see the following. - -[source,python] ----- -mse = (error**2).mean() -mse.backward() ----- - -Try it yourself! - -[TIP] -==== -If you get an error: - -.error ----- -RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn ----- - -This is likely due to the fact that your starting values aren't tensors! Instead, use tensors. - -[source,python] ----- -beta0 = torch.tensor(5) -beta1 = torch.tensor(4) -beta2 = torch.tensor(3) ----- - -What? We _still_ get that error. In order for the `backward` method to work, and _automatically_ (yay!) calculate our partial derivatives, we need to make sure that our starting value tensors are set to be able to store the partial derivatives. We can do this very easily by setting the `requires_grad=True` option when creating the tensors. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True) -beta1 = torch.tensor(4, requires_grad=True) -beta2 = torch.tensor(3, requires_grad=True) ----- - -You probably got the following error now. - -.error ----- -RuntimeError: Only Tensors of floating point and complex dtype can require gradients ----- - -Well, let's set the dtype to be `torch.float` and see if that does the trick, then. - -[source,python] ----- -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) ----- - -Great! Unfortunately, after you try to run your epochs, you will likely get the following error. - -.error ----- -TypeError: unsupported operand type(s) for *: 'float' and 'NoneType' ----- - -This is because your `beta0.grad`, `beta1.grad` are None -- why? The partial derivatives (or gradients) are stored in the `beta0`, `beta1`, and `beta2` tensors. If you performed a parameter update as follows. - -[source,python] ----- -beta0 = beta0 - learning_rate * beta0.grad ----- - -The _new_ `beta0` object will have _lost_ the partial derivative information, and the `beta0.grad` will be `None`, causing the error. How do we get around this? We can use a Python _inplace_ operation. An _inplace_ operation will actually _update_ our _original_ `beta0` (_with_ the gradients already saved), instead of creating a brand new `beta0` that loses the gradient. You've probably already seen examples of this in the wild. - -[source,python] ----- -# these are equivalent -a = a - b -a -= b - -# or -a = a * b -a *= b - -# or -a = a + b -a += b - -# etc... ----- - -At this point in time, you are probably _once again_ getting the following error. - -.error ----- -RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ----- - -This too is an easy fix, simply wrap your update lines in a `with torch.no_grad():` block. - -[source,python] ----- -with torch.no_grad(): - beta0 -= ... - beta1 -= ... - beta2 -= ... ----- - -Woohoo! Finally! But... you may notice (if you are printing your MSE) that the MSE is all over the place and not decreasing like we would expect. This is because the gradients are summed up each iteration unless your clear the gradient out! For example, if during the first epoch the gradient is 603, and the next epoch it is -773. If you do _not_ zero out the gradient, your new gradient after the second epoch will be -169, when we really want -773. To fix _this_, use the `zero_` method from the `grad` attribute. Zero out _all_ of your gradients at the end of each epoch and try again. - -[source,python] ----- -beta0.grad.zero_() ----- - -Finally! It should all be looking good right now. Okay, so `pytorch` is quite particular, _but_ the power of the automatic differentiation can't be overstated. -==== - -[IMPORTANT] -==== -Make sure and make a post on Piazza if you'd like some extra help or think there is a question that could use more attention. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Whoa! That is crazy powerful! That _greatly_ reduces the amount of work we need to do. We didn't use our partial derivative formulas anywhere, how cool! - -But wait, there's more! You know that step where we update our parameters at the end of each epoch? Think about a scenario where, instead of simply 3 parameters, we had 1000 parameters to update. That would involve a linear increase in the number of lines of code we would need to write -- instead of just 3 lines of code to update our 3 parameters, we would need 1000! Not something most folks are interested in doing. `pytorch` to the rescue. - -We can use an _optimizer_ to perform the parameter updates, all at once! Update your code to utilize an optimizer to perform the parameter updates. - -There are https://pytorch.org/docs/stable/optim.html[a variety] of different optimizers available. For this project, let's use the `SGD` optimizer. You can see the following example, directly from the linked webpage. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.zero_grad() -loss_fn(model(input), target).backward() -optimizer.step() ----- - -Here, you can just focus on the following lines. - -[source,python] ----- -optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) -optimizer.step() ----- - -The first line is the initialization of the optimizer. Here, you really just need to pass our initialized paramters (the betas) as a list to the first argument to `optim.SGD`. The second argument, `lr`, should just be our learning rate (`0.0003`). - -Then, the second line replaces the code where the three parameters are updated. - -[NOTE] -==== -You will no longer need the `with torch.no_grad()` block at all! This completely replaces that code. -==== - -[TIP] -==== -In addition, you can use the optimizer to clear out the gradients as well! Replace the `zero_` methods with the `zero_grad` method of the optimizer. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -You are probably starting to notice how `pytorch` can _really_ simplify things. But wait, there's more! - -In each epoch, you are still calculating the loss manually. Not a huge deal, but it could be a lot of work, and MSE is not the _only_ type of loss function. Use `pytorch` to create your MSE loss function, and use it instead of your manual calculation. - -You can find `torch.nn.MSELoss` documentation https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss[here]. Use the option `reduction='mean'` to get the mean MSE loss. Once you've created your loss function, simply pass your `y_train` as the first argument and your `y_predictions` as the second argument. Very cool! This has been a lot to work on -- the main takeaways here should be that `pytorch` has the capability of greatly simplifying code (and calculus!) like the code used for the gradient descent algorithm. At the same time, `pytorch` is particular, the error messages aren't extremely clear, and it definitely involves a learning curve. - -We've barely scraped the surface of `pytorch` -- there is (always) a _lot_ more to learn! In the next project, we will provide you with the opportunity to utilize a GPU to speed up calculations, and SLURM to parallelize some costly calculations. - -[NOTE] -==== -In the next project we will use `pytorch` to build a model to simplify our code even more, in addition, we will incorporate SLURM and use a GPU to train our model. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project10.adoc deleted file mode 100644 index 6d5c9b9d1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project10.adoc +++ /dev/null @@ -1,383 +0,0 @@ -= TDM 30200: Project 10 -- 2023 - -**Motivation:** In this project, we will utilize SLURM for a couple of purposes. The first is to have the chance to utilize a GPU on the cluster for some `pytorch` work, and the second is to use resampling to get point estimates. We can then use those point estimates to make a confidence interval and gain a better understand of the variability of our model. - -**Context:** This is the fourth of a series of 4 projects focused on using SLURM. This project is also an interlude to a series of projects on `pytorch` and `JAX`. We will use `pytorch` for our calculations. - -**Scope:** SLURM, unix, bash, `pytorch`, Python - -.Learning Objectives -**** -- Demystify a "tensor". -- Utilize the pytorch API to create, modify, and operate on tensors. -- Use simple, simulated data to create a multiple linear regression model using closed form solutions. -- Use pytorch to calculate a popular uncertainty quantification, the 95% confidence interval. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/sim/train.csv` -- `/anvil/projects/tdm/data/sim/test.csv` -- `/anvil/projects/tdm/data/sim/train100k.csv` -- `/anvil/projects/tdm/data/sim/train10m.csv` - -== Questions - -[WARNING] -==== -You do not want to wait until the end of the week to do part 1 of this project. Part 1 is pretty straightforward, and basically just requires running code that you've already written a variety of times. There is limited GPU access, so this is the constraint and reason you should attempt to run through part 1 earlier, rather than later. -==== - -[NOTE] -==== -This project is broken into two parts. In part 1, we will use `pytorch` and build our model using cpus and gpus, and draw comparisons. Models will be built using datasets of differing sizes. The goal of part 1 is to see how a GPU _can_ make a large impact on training time. Note that these datasets are synthetic data and don't really represent a realistic scenario, but they _do_ work well to illustrate how powerful GPUs are. - -Part 2 is a continuation from the previous project. In the previous project, you used `pytorch` to perform a gradient descent and build a model for our small, simulated dataset. While it is certainly possible to use other methods to get some form of uncertainty quantification (in our case, we are specifically looking at a 95% confidence interval for our predictions), it is not always easy to do so, or possible. One of the most common methods to calculate these things, in these difficult situations is bootstrapping. In fact, Dr. Andrew Gelman, a world-class statistician, had this as his second item in his https://arxiv.org/pdf/2012.00174.pdf[list of the top 50 influential statistical ideas in the past 50 years]. We will use SLURM to perform this computationally intensive, but relatively simple method. -==== - -=== Part 1 - -[WARNING] -==== -You should all have been granted access to our GPU allocation. If you try to use the GPU allocation and run into issues, please create a post in Piazza and make sure you include your Anvil username. To find your Anvil username, you can run the following in a terminal inside your Jupyter Notebook: - -[source,bash] ----- -echo $USER ----- -==== - -[IMPORTANT] -==== -This question should be completed our GPU allocation, since our regular allocation does not have access to GPUs. - -To launch the Jupyter Lab instance using our GPU allocation, use the typical Jupyter Notebook option at https://ondemand.anvil.rcac.purdue.edu. However, instead of using the default options, use the following: - -- Allocation: cis220051-gpu -- Queue: gpu -- Time in Hours: 1 -- Cores: 4 -- Use Jupyter Lab instead of Jupyter Notebook (checked) - -To confirm you have access to the GPU you can use the following code. Note that you only really need one of these, but I am showing them all because they _may_ be interesting to you. - -[source,python] ----- -import torch - -# see if cuda is available -torch.cuda.is_available() - -# see the current device -torch.cuda.current_device() - -# see the number of devices available -torch.cuda.device_count() - -# get the name of a device -torch.cuda.get_device_name(0) ----- -==== - -For this question you will use `pytorch` with cpus (like in the previous project) to build a model for `train.csv`, `train100k.csv`, and `train10m.csv`. Use the `%%time` Jupyter magic to time the calculation for each dataset. - -[TIP] -==== -The following is the code from the previous project that you can use to get started. - -[source,python] ----- -import torch -import pandas as pd - -dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv") -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) - -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) -learning_rate = .0003 - -num_epochs = 10000 -optimizer = torch.optim.SGD([beta0, beta1, beta2], lr=learning_rate) -mseloss = torch.nn.MSELoss(reduction='mean') - -for idx in range(num_epochs): - # calculate the predictions / forward pass - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 - - # calculate the MSE - mse = mseloss(y_train, y_predictions) - - if idx % 100 == 0: - print(f"MSE: {mse}") - - # calculate the partial derivatives / backwards step - mse.backward() - - # update our parameters - optimizer.step() - - # zero out the gradients - optimizer.zero_grad() - -print(f"beta0: {beta0}") -print(f"beta1: {beta1}") -print(f"beta2: {beta2}") ----- -==== - -[IMPORTANT] -==== -For `train10m.csv`, instead of running the entire 10k epochs, just perform 100 epochs, and estimate the amount of time it would take to complete 10k epochs. We _try_ not to be _that_ mean, although, if you _do_ want to wait and see, that is perfectly fine. -==== - -Modify your code to use a gpu instead of cpus, and time the time it takes to train the model using `train.csv`, `train100k.csv`, and `train10m.csv`. What percentage faster is the GPU calculations for each dataset? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Time it took to build the model for the `train.csv` and `train100k.csv` using cpus. In addition, the estimated time it would take to build the model for `train10m.csv`, again, using cpus. -- Time it took to build the model for the `train.csv`, `train100k.csv`, and `train10m.csv`, using gpus. -- What percentage faster (or slower) the GPU version is vs the CPU version for each dataset. -==== - -=== Part 2 - -[IMPORTANT] -==== -You can now save your notebook, and switch back to using the regular `cis220051` allocation -- don't forget to also change the queue to "shared". **Be careful not to overwrite your output from part 1.** -==== - -We've provided you with a Python script called `bootstrap_samples.py` that accepts a single value, for example 10, and runs the code you wrote in the previous project 10 times. This code should have a few modifications. One major, but simple modification is that rather than using our training data to build the model, instead, sample the same number of values in our `x_train` tensor _from_ our `x_train` tensor, _with_ replacement. What this means is if our `x_train` contained 1,2,3, we could produce any of the following samples 1,2,3 or 1,1,2 or 1,2,2 or 3,3,3 etc. We called these resampled values `xr_train`. Then proceed as normal, building your model using `xr_train` instead of `x_train`. - -In addition at the end of the script, we used your model to get predictions for all of the values in `x_test`. Save these predictions to a parquet file, for example, `0cd68e5e-134d-4575-a31d-2060644f4caa.parquet`, in a safe location, for example `$SCRATCH/p10output/`. Each file will each contain a single set of point estimates for our predictions. - -.bootstrap_samples.py -[source,python] ----- -import sys -import argparse -import pandas as pd -import random -import torch -from pathlib import Path -import uuid - - -class Regression(torch.nn.Module): - def __init__(self): - super().__init__() - self.beta0 = torch.nn.Parameter(torch.tensor(5, requires_grad=True, dtype=torch.float)) - self.beta1 = torch.nn.Parameter(torch.tensor(4, requires_grad=True, dtype=torch.float)) - self.beta2 = torch.nn.Parameter(torch.tensor(3, requires_grad=True, dtype=torch.float)) - - def forward(self, x): - return self.beta0 + self.beta1*x + self.beta2*x**2 - - -def get_point_estimates(x_train, y_train, x_test): - - model = Regression() - learning_rate = .0003 - - num_epochs = 10000 - optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) - mseloss = torch.nn.MSELoss(reduction='mean') - - # resample data - resampled_idxs = random.choices(range(75), k=75) - xr_train = torch.tensor(x_train[resampled_idxs], requires_grad=True, dtype=torch.float).reshape(75) - - for _ in range(num_epochs): - # set to training mode -- note this does not _train_ anything - model.train() - - # calculate the predictions / forward pass - y_predictions = model(xr_train) - - # calculate the MSE - mse = mseloss(y_train[resampled_idxs], y_predictions) - - # calculate the partial derivatives / backwards step - mse.backward() - - # update our parameters - optimizer.step() - - # zero out the gradients - optimizer.zero_grad() - - # get predictions - predictions = pd.DataFrame(data={"predictions": model(x_test).detach().numpy()}) - - return(predictions) - - -def main(): - parser = argparse.ArgumentParser() - subparsers = parser.add_subparsers(help="possible commands", dest="command") - bootstrap_parser = subparsers.add_parser("bootstrap", help="") - bootstrap_parser.add_argument("n", type=int, help="number of set of point estimates for predictions to output") - bootstrap_parser.add_argument("-o", "--output", help="directory to output file(s) to") - - if len(sys.argv) == 1: - parser.print_help() - sys.exit(1) - - args = parser.parse_args() - - if args.command == "bootstrap": - - dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv") - x_train = torch.tensor(dat['x'].to_numpy(), dtype=torch.float) - y_train = torch.tensor(dat['y'].to_numpy(), dtype=torch.float) - - dat = pd.read_csv("/anvil/projects/tdm/data/sim/test.csv") - x_test = torch.tensor(dat['x'].to_numpy(), dtype=torch.float) - - for _ in range(args.n): - estimates = get_point_estimates(x_train, y_train, x_test) - estimates.to_parquet(f"{Path(args.output) / str(uuid.uuid4())}.parquet") - -if __name__ == "__main__": - main() ----- - -[IMPORTANT] -==== -Make sure your `p10output` directory exists! - -[source,bash] ----- -mkdir -p $SCRATCH/p10output ----- -==== - -[TIP] -==== -You can use the script like the following, in order to create 10 sets of point estimates: - -[source,bash] ----- -singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 /path/to/bootstrap_samples.py bootstrap 10 --output /anvil/scratch/USERNAME/p10output/ ----- - -Make sure the `p10output` directory exists first! Also, replace `USERNAME` with your Anvil username. -==== - -Next, create your job script. Let's call this `p10_job.sh`. You can use the following code. We would highly recommend using 10 cores to generate a total of 2000 sets of point estimates. The total runtime will vary but should be anywhere from 5 to 15 minutes. - -.p10_job.sh -[source,bash] ----- -#!/bin/bash -#SBATCH --account=cis220051 # Queue -#SBATCH --partition=shared -#SBATCH --job-name=kevinsjob # Job name -#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) -#SBATCH --mail-user=kamstut@purdue.edu # Where to send mail -#SBATCH --time=00:30:00 -#SBATCH --ntasks=10 # Number of tasks (total) -#SBATCH -o /dev/null # Output to dev null -#SBATCH -e /dev/null # Error to dev null - -for((i=0; i < 10; i+=1)) -do - srun -A cis220051 -p shared --exact -n 1 -c 1 singularity exec /anvil/projects/tdm/apps/containers/images/python:f2022-s2023.sif python3 $HOME/bootstrap_samples.py bootstrap 200 --output $SCRATCH/p10output/ & -done - -wait ----- - -[TIP] -==== -You won't need any of that array stuff anymore since we don't have to keep track of the files we're working with. -==== - -[IMPORTANT] -==== -Make sure both `bootstrap_samples.py` and `p10_job.sh` have execute permissions. - -[source,bash] ----- -chmod +x /path/to/bootstrap_samples.py -chmod +x /path/to/p10_job.sh ----- -==== - -Submit your job using `sbatch p10_job.sh`. - -[WARNING] -==== -Make sure to clear out the SLURM environment variables if you choose to run the `sbatch` command from within a bash cell in your notebook. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- -==== - -Great! Now you have a directory `$SCRATCH/p10output/` that contains 2000 sets of point estimates. Your job is now to process this data to create a graphic showing: - -. The _actual_ `y_test` values (in blue) as a set of points (using `plt.scatter`). -. The predictions as a line. -. The confidence intervals as a shaded region. (You can use `plt.fill_between`). - -The 95% confidence interval is simply the 97.5th percentile of each prediction's point estimates (upper) and the 2.5th percentile of each prediction's point estimates (lower). - -[TIP] -==== -You can import via: - -[source,python] ----- -import matplotlib.pyplot as plt ----- -==== - -[IMPORTANT] -==== -You will need to run the algorithm to get your predictions using the non-resampled training data -- otherwise you won't have the predictions to plot! -==== - -[TIP] -==== -You will notice that some of your point estimates will be NaN. Resampling can cause your model to no longer converge unless you change the learning rate. Remove the NaN values, you should be left with around 1500 sets of point estimates that you can use. -==== - -[TIP] -==== -You can loop through the output files by doing something like: - -[source,python] ----- -from pathlib import Path - -for file in Path("/anvil/scratch/USERNAME/p10output/").glob("*.parquet"): - pass ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- 2-3 sentences explaining the "other" changes in the provided script. -- 1-2 sentences describing your opinion of the changes. -- `p10_job.sh`. -- Your resulting graphic -- make sure it renders properly when viewed in Gradescope. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project11.adoc deleted file mode 100644 index 5abea5782..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project11.adoc +++ /dev/null @@ -1,409 +0,0 @@ -= TDM 30200: Project 11 -- 2023 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the `pytorch` and `tensorflow` libraries — `JAX` is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the third of a series of 4 projects focused on using `pytorch` and `JAX` to solve numeric problems. - -**Scope:** Python, JAX - -.Learning Objectives -**** -- Compare and contrast `pytorch` and `JAX`. -- Differentiate functions using `JAX`. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- Vectorize functions using the `vmap` function from `JAX`. -- How do random number generators work in `JAX`? -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/sim/train.csv` - -== Questions - -=== Question 1 - -`JAX` is a library for high performance computing. It falls into the same category as other popular packages like: `numpy`, `pytorch`, and `tensorflow`. `JAX` is a product of Google / Deepmind that takes a completely different approach than their other product, `tensorflow`. - -Like the the other popular libraries, `JAX` can utilize GPUs/TPUs to greatly speed up computation. Let's take a look. - -Here is a snippet of code from previous projects that uses `pytorch` and calculates predictions 10000 times. - -[NOTE] -==== -Of course, this is the same calculation since our betas aren't being updated yet, but just bear with me. -==== - -[source,python] ----- -import pandas as pd -import torch -import jax -import jax.numpy as jnp - -dat = pd.read_csv("/anvil/projects/tdm/data/sim/train.csv") ----- - -[source,python] ----- -%%time - -x_train = torch.tensor(dat['x'].to_numpy()) -y_train = torch.tensor(dat['y'].to_numpy()) - -beta0 = torch.tensor(5, requires_grad=True, dtype=torch.float) -beta1 = torch.tensor(4, requires_grad=True, dtype=torch.float) -beta2 = torch.tensor(3, requires_grad=True, dtype=torch.float) - -num_epochs = 10000 - -for idx in range(num_epochs): - - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 ----- - -Approximately how much time does it take to run this second chunk of code (after we have already read in our data)? - -Here is the equivalent `JAX` code: - -[source,python] ----- -%%time - -x_train = jnp.array(dat['x'].to_numpy()) -y_train = jnp.array(dat['y'].to_numpy()) - -beta0 = 5 -beta1 = 4 -beta2 = 3 - -num_epochs = 10000 - -for idx in range(num_epochs): - - y_predictions = beta0 + beta1*x_train + beta2*x_train**2 ----- - -How much time does this take? - -At this point in time you may be questioning how `JAX` could possibly be worth it. At first glance, the new code _does_ look a bit cleaner, but not clean enough to use code that is around 3 times slower. - -This is where `JAX` first trick, or _transformation_ comes in to play. When we refer to _transformation_, think of it as an operation on some function that produces another function as an output. - -The first _transformation_ we will talk about is `jax.jit`. "JIT" stands for "Just In Time" and refers to a "Just in time" compiler. Essentially, just in time compilation is a trick that can be used to _greatly_ speed up the execution of _some_ code by compiling the code. In a nutshell, the compiled version of the code has a wide variety of optimizations that speed your code up. - -Lots of our computation time is spent inside our loop, specifically when we are calculating our `y_predictions`. Let's see if we can use the jit transformation to speed up our `JAX` code with little to no extra effort. - -Write a function called `model` that accepts two arguments. The first argument is a tuple containing our parameters: `beta0`, `beta1`, and `beta2`. The second is our _input_ to our function (our x values) called `x`. `model` should then _unpack_ our tuple of parameters into `beta0`, `beta1`, and `beta2`, and then return predictions (the same formula shown above, twice). Replace the code as follows. - -[source,python] ----- -# replace this line -y_predictions = beta0 + beta1*x_train + beta2*x_train**2 - -# with -y_predictions = model((beta0, beta1, beta2), x_train) ----- - -Run and time the code again. No difference? Well, we didn't use our jit transformation yet! Using the transformation is easy. `JAX` provides two equivalent ways. You can either _decorate_ your `model` function with the `@jax.jit` https://realpython.com/primer-on-python-decorators/[decorator], or simply apply the transformation to your function and save the new, jit compiled function and use _it_ instead. - -[source,python] ----- -def my_func(x): - return x**2 - -@jax.jit -def my_func_jit1(x): - return x**2 - -my_func_jit2 = jax.jit(my_func) ----- - -Re-run your code using the JIT transformation. Is it faster now? - -[NOTE] -==== -It is important to note that `pytorch` _does_ have some `jit` functionality, and there is also a package called `numba` which can help with this as well, however, it is not as straightforward to perform the same operation using either as it is using `JAX`. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -At this point in time you may be considering slapping `@jax.jit` on all your functions -- unfortunately it is not quite so simple! First of all, the previous comparison was actually not fair at all. Why? `JAX` has asynchronous dispatch by default. What this means is that, by default, `JAX` will return control to Python as soon as possible, even if it is _before_ the function has been fully evaluated. - -What does this mean? It means that our finished example from question 1 may be returning a not-yet-complete result, greatly throwing off our performance measurements. So how can we _synchronously_ wait for execution to finish? This is easy, simply use the `block_until_ready` method built in to your jit compiled `model` function. - -[source,python] ----- -def my_func(x): - return x**2 - -@jax.jit -def my_func_jit1(x): - return x**2 - -my_func_jit2 = jax.jit(my_func) - -my_func_jit1.block_until_ready() - -# or - -my_func_jit2.block_until_ready() ----- - -Re-run your code from before -- you should find that the results are unchanged, it turns out that really _was_ a serious speedup from before. Great. Let's move on from this part of things. Back to our question. Why can't we just slap `@jax.jit` on any function and expect a speedup? - -Take the following function. - -[source,python] ----- -def train(params, x, y, epochs): - def _model(params, x): - beta0, beta1, beta2 = params - return beta0 + beta1*x + beta2*x**2 - - mses = [] - for _ in range(epochs): - y_predictions = _model(params, x_train) - mse = jnp.sum((y_predictions - y)**2) - -fast_train = jax.jit(train) - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -If you try running it you will get an error saying something along the lines of "TracerIntegerConversionError". The problem with this function, and why it cannot be jit compiled, is the `epochs` argument. By default, `JAX` tries to "trace" the parameters to determine its effect on inputs of a specific shape and type. Control flow cannot depend on traced values -- in this case, `epochs` is relied on in order to determine how many times to loop. In addition, the _shapes_ of all input and output values of a function must be able to be determined ahead of time. - -How do we fix this? Well, it is not always possible, however, we _can_ choose to select parameters to be _static_ or not traced. If a parameter is marked as static, or not traced, it can be JIT compiled. The catch is that any time a call to the function is made and the value of the static parameter is changed, the function will have to be recompiled with that new static value. So, this is only useful if you will only occasionally change the parameter. This sounds like our case! We only want to occasionally change the number of epochs, so perfect. - -You can mark a parameter as static by specifying the argument position using the `static_argnums` argument to `jax.jit`, or by specifying the argument _name_ using the `static_argnames` argument to `jax.jit`. - -Force the `epochs` argument to be static, and use the `jax.jit` decorator to compile the function. Test out the function, in order using the following code cells. - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 10000) ----- - -[source,ipython] ----- -%%time - -fast_train((beta0, beta1, beta2), x_train, y_train, 9999) ----- - -Do your best to explain why the last code cell was once again slower. - -[TIP] -==== -If you aren't sure why, reread the question text -- we hint at the "catch" in the text. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -We learned that one of the coolest parts of the `pytorch` package was the automatic differentiation feature. It saves a _lot_ of time doing some calculus and coding up resulting equations. Recall that in `pytorch` this differentiation was baked into the `backward` method of our MSE. This is quite different from the way we think about the equations when looking at the math, and is quite confusing. - -`JAX` has the same functionality, but it is _much_ cleaner and easier to use. We will provide you with a simple example, and explain the math as we go along. - -Let's say our function is $f(x) = 2x^2$. We can start by writing a function. - -[source,python] ----- -def two_x_squared(x): - return 2*x**2 ----- - -Fantastic, so far pretty easy. - -The derivative w.r.t. `x` is $4x$. Doing this in `JAX` is as easy as applying the `jax.grad` _transformation_ to the function. - -[source,python] ----- -deriv = jax.grad(two_x_squared) ----- - -Okay, test out both functions as follows. - -[source,python] ----- -my_array = jnp.array([1.0, 2.0, 3.0]) - -two_x_squared(4.0) # 32.0 -two_x_squared(my_array) # [2.0, 8.0, 18.0] -deriv(4.0) # 16.0 -deriv(my_array) # uh oh! Something went wrong! ----- - -[IMPORTANT] -==== -A very perceptive student pointed out that we originally passed array values that were ints to `jax.grad`. This will fail. You can read more about why https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#non-array-inputs-numpy-vs-jax[here]. -==== - -On the last line, you probably received a message or error saying something along the lines of "Gradient only defined for scalar-ouput functions. What this means is that the resulting derivative function is not _vectorized_. As you may have guessed, this is easily fixed. Another key _transformation_ that `JAX` provides is called `vmap`. `vmap` takes a function and creates a vectorized version of the function. See the following. - -[source,python] ----- -vectorized_deriv_squared = jax.vmap(deriv) -vectorized_deriv_squared(my_array) # [4.0, 8.0, 12.0] ----- - -Heck yes! That is pretty cool, and very powerful. It is _so_ much more understandable than the magic happening in the `pytorch` world too! - -Dig back into your memory about any equation you may have had in the past where you needed to find a derivative. Create a Python function, find the derivative, and test it out on both a single value, like `4.0` as well as an array, like `jnp.array([1.0,2.0,3.0])`. Don't hesitate to make it extra fun and include some functions like `jnp.cos`, `jnp.sin`, etc. Did everything work as expected? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, great, but that was a straight-forward example. What if we have multiple parameters we'd like to take partial derivatives with respect to? `jax.grad` can handle that too! - -Read https://jax.readthedocs.io/en/latest/jax-101/01-jax-basics.html#jax-first-transformation-grad[this] excellent example in the official JAX documentation. - -[NOTE] -==== -The JAX documentation is pretty excellent! If you are interested, I would recommend reading through it, it is very well written. -==== - -Given the following (should be familiar) model, create a function called `get_partials` that accepts an argument `params` (a tuple containing beta0, beta1, and beta2, in order) and an argument `x`, that can be either a single value (a scalar), or a `jnp.array` with multiple values. This function should return a single value for each of the 3 partial derivatives, where `x` is plugged into each of the 3 partial derivatives to calculate each value, OR, 3 arrays of results where there are 3 values for each value in the input array. - -[source,python] ----- -@jax.jit -def model(params, x): - beta0, beta1, beta2 = params - return beta0 + beta1*x + beta2*x**2 ----- - -.example using it -[source,python] ----- -model((1.0, 2.0, 3.0), 4.0) # 57 -model((1.0, 2.0, 3.0), jnp.array((4.0, 5.0, 6.0))) # [57, 86, 121] ----- - -Since we have 3 parameters, we will have 3 partial derivatives, and our new function should output a value for each of our 3 partial derivatives, for each value passed as `x`. To be explicit and allow you to check your work, the results should be the same as the following. - -[source,python] ----- -params = (5.0, 4.0, 3.0) -get_partials(params, x_train) ----- - -.output ----- -((DeviceArray([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., - 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32, weak_type=True), - DeviceArray([-15.94824 , -11.117526 , -10.4780855 , -8.867778 , - -8.799367 , -8.140428 , -7.8744955 , -7.72306 , - -6.9281745 , -6.2731333 , -6.2275624 , -5.7271757 , - -5.1857414 , -5.150156 , -4.8792663 , -4.663747 , - -4.58701 , -4.1310377 , -4.0215836 , -4.019455 , - -3.5578184 , -3.4748363 , -3.4004524 , -3.1221437 , - -3.0421085 , -2.941131 , -2.8603644 , -2.8294718 , - -2.7050996 , -1.9493109 , -1.7873074 , -1.2773769 , - -1.1804487 , -1.1161369 , -1.1154363 , -0.8590109 , - -0.81457555, -0.7386795 , -0.57577926, -0.5536533 , - -0.51964295, -0.12334588, 0.11549235, 0.14650635, - 0.24305418, 0.2876291 , 0.3942046 , 0.6342466 , - 0.8256681 , 1.2047065 , 1.9168468 , 1.9493027 , - 1.9587051 , 2.3490443 , 2.7015095 , 2.8161156 , - 2.8648841 , 2.946292 , 3.1312609 , 3.1810293 , - 4.503682 , 5.114829 , 5.1591663 , 5.205859 , - 5.622392 , 5.852435 , 6.21313 , 6.4066596 , - 6.655888 , 6.781989 , 7.1651325 , 7.957219 , - 8.349893 , 11.266327 , 13.733376 ], dtype=float32, weak_type=True), - DeviceArray([2.54346375e+02, 1.23599388e+02, 1.09790276e+02, - 7.86374817e+01, 7.74288559e+01, 6.62665634e+01, - 6.20076790e+01, 5.96456566e+01, 4.79996033e+01, - 3.93521996e+01, 3.87825356e+01, 3.28005409e+01, - 2.68919144e+01, 2.65241070e+01, 2.38072395e+01, - 2.17505341e+01, 2.10406590e+01, 1.70654716e+01, - 1.61731339e+01, 1.61560173e+01, 1.26580715e+01, - 1.20744877e+01, 1.15630760e+01, 9.74778175e+00, - 9.25442410e+00, 8.65025234e+00, 8.18168449e+00, - 8.00591087e+00, 7.31756353e+00, 3.79981303e+00, - 3.19446778e+00, 1.63169169e+00, 1.39345896e+00, - 1.24576163e+00, 1.24419820e+00, 7.37899661e-01, - 6.63533330e-01, 5.45647442e-01, 3.31521749e-01, - 3.06531966e-01, 2.70028800e-01, 1.52142067e-02, - 1.33384829e-02, 2.14641113e-02, 5.90753369e-02, - 8.27304944e-02, 1.55397251e-01, 4.02268738e-01, - 6.81727827e-01, 1.45131791e+00, 3.67430139e+00, - 3.79978085e+00, 3.83652544e+00, 5.51800919e+00, - 7.29815340e+00, 7.93050718e+00, 8.20756149e+00, - 8.68063641e+00, 9.80479431e+00, 1.01189480e+01, - 2.02831535e+01, 2.61614761e+01, 2.66169968e+01, - 2.71009693e+01, 3.16112938e+01, 3.42509956e+01, - 3.86029854e+01, 4.10452881e+01, 4.43008461e+01, - 4.59953766e+01, 5.13391228e+01, 6.33173370e+01, - 6.97207031e+01, 1.26930122e+02, 1.88605606e+02], dtype=float32, weak_type=True)),) ----- - -[source,python] ----- -get_partials((1.0,2.0,3.0), jnp.array((4.0,))) ----- - -.output ----- -((DeviceArray([1.], dtype=float32, weak_type=True), - DeviceArray([4.], dtype=float32, weak_type=True), - DeviceArray([16.], dtype=float32, weak_type=True)),) ----- - -[TIP] -==== -To specify which arguments to take the partial derivative with respect to, use the `argnums` argument to `jax.grad`. In our case, our first argument is really 3 parameters all at once, so if you did `argnums=(0,)` it would take 3 partial derivatives. If you specified `argnums=(0,1)` it would take 4 -- that last one being with respect to x. -==== - -[TIP] -==== -To vectorize your resulting function, use `jax.vmap`. This time, since we have many possible arguments, we will need to specify the `in_axes` argument to `jax.vmap`. `in_axes` will accept a tuple of values -- one value per parameter to our function. Since our function has 2 arguments: `params` and `x`, this tuple should have 2 values. We should put `None` for arguments that we don't want to vectorize over (in this case, `params` stays the same for each call, so the associated `in_axes` value for `params` should be `None`). Our second argument, `x`, should be able to be a vector, so you should put `0` for the associated `in_axes` value for `x`. - -This is confusing! However, considering how powerful and all that is baked into the `get_partials` function, it is probably acceptable to have to sit an think a bit to figure this out. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project12.adoc deleted file mode 100644 index 830ade4dd..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project12.adoc +++ /dev/null @@ -1,236 +0,0 @@ -= TDM 30200: Project 12 -- 2023 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, and two of the most popular tools surrounding said topics are the pytorch and tensorflow libraries — JAX is another tool by Google growing in popularity. These tools are libraries used to build and use complex models. If available, they can take advantage of GPUs to speed up parallelizable code by a hundred or even thousand fold. - -**Context:** This is the last of a series of 4 projects focused on using pytorch and JAX to solve numeric problems. - -**Scope:** Python, JAX - -.Learning Objectives -**** -- Compare and contrast pytorch and JAX. -- Differentiate functions using JAX. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- Vectorize functions using the vmap function from JAX. -- How do random number generators work in JAX? -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Last weeks project was a bit fast paced, so we will slow things down considerablyto try and compensate, and give you a chance to digest and explore more. We will: - -- Learn how `JAX` handles generating random numbers differently than most other packages. -- Write a function in `numpy` to calculate the Hamming distance between a given image hash and the remaining (around 123k) image hashes. -- Play around with the hash data and do some sanity checks. - -Let's start by taking a look at the documentation for https://jax.readthedocs.io/en/latest/jax-101/05-random-numbers.html[random number generation]. Carefully read the page -- it's not that long. - -The documentation gives the following example. - -[source,python] ----- -import numpy as np - -np.random.seed(0) - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo(): return bar() + 2 * baz() - -print(foo()) ----- - -It then goes on to say that `JAX` may try to parallelize the `bar` and `baz` functions. As a result, we would not know which would run first, `bar` or `baz`. This would change the results of `foo`. Below, we've modified the code to emulate this. - -[source,python] ----- -import numpy as np -import random - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo1(): return bar() + 2 * baz() - -def foo2(): return 2*baz() + bar() - -def foo(*funcs): - functions = list(funcs) - random.shuffle(functions) - return functions[0]() ----- - -[source,python] ----- -np.random.seed(0) -foo(foo1, foo2) ----- - -.output ----- -# sometimes this -1.9791922366721637 - -# sometimes this -1.812816374227069 ----- - -`JAX` has a much different way of dealing with this. While the solution is clean and effective, and allows such code to be parallelized, it _can_ be a bit more cumbersome managing and passing around keys. Create a modified version of this code using `JAX`, and passing around keys. Fill in the `?` parts. - -[source,python] ----- -import numpy as np - -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=?) - -def bar(key): - return ? - -def baz(key): - return ? - -def foo1(key1, key2): - return bar(key1) + 2 * baz(key2) - -def foo2(key1, key2): - return 2*baz(key2) + bar(key1) - -def foo(funcs, keys): - functions = list(funcs) - random.shuffle(functions) - return ? ----- - -[source,python] ----- -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=3) -print(foo((foo1, foo2), (subkeys[0], subkeys[1]))) ----- - -.output ----- -# always -2.3250647 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Write a function called `get_distances_np` that accepts a filename (as a string) (`fm_hash`), and a path (as a string) (`path`). - -`get_distances_np` should return a numpy array of the distances between the hash for `fm_hash` and every other image hash in `path`. - -For this question, use the dataset of hashed images found in `/anvil/projects/tdm/data/coco/hashed02/`. An example of a call to `get_distances_np` would look like the following. - -[source,python] ----- -from pathlib import Path -import imagehash -import numpy as np ----- - -[source,python] ----- -%%time - -hshs = get_distances_np("000000000008.jpg", "/anvil/projects/tdm/data/coco/hashed02/") -hshs.shape # (123387, 1) ----- - -How long does it take to run this function? - -Make plots and/or summary statistics to check out the distribution of the distances. How does it look? - -[TIP] -==== -The distance we want you to calculate is the https://en.wikipedia.org/wiki/Hamming_distance[Hamming distance]. We've provided you with a function that accepts two numpy arrays and returns the Hamming distance between them. - -[source,python] ----- -def _hamming_distance(hash1, hash2): - return sum(~(hash1 == hash2)) ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -What do you think about the design of the `get_distances_np` function, considering that we are interested in pairwise Hamming distances? - -At its core, we essentially have a vector of 123k values. If we were to get the pairwise distances, the resulting distances would fill the upper triangle of a 123k by 123k matrix. This would be a _very large_ amount of data, considering it is just numeric data -- more than can easily fit in memory. - -In addition, the part of the function from question 2 that takes the majority of the run time is _not_ the numeric computations, but rather the opening and reading of the 123k hashes. Approximately 55 of the 65-70 seconds. With this in mind, let's back up, and break this problem down further. - -Write a code cell containing code that will read in all of the hashes into a `numpy` array of size (123387, 64). - -This array contains the hashes for each of the 123k images. Each row is the hash of an image. Let's call the resulting (123387, 64) array `hashes`. - -Given what we know, the following is a very fast function that will find the Hamming distances between a single image and all of the other images. - -[source,python] ----- -def hamming_distance(hash1, hash2): - return np.sum(~(hash1 == hash2), axis=1) ----- - -[source,python] ----- -%%time - -hamming_distance(hashes[0], hashes) ----- - -This runs in approximately 16 ms. This would be about 32 minutes if we calculated the distance for every pair. - -Convert your `numpy` array into a `JAX` array, and create an equivalent function. How fast does this function run? What would the approximate runtime be for the total calculation? - -[IMPORTANT] -==== -Remember to use `jax.jit` to speed up the function. Also recall that the first run of the compiled function will be _slow_ since it needs to be compiled. After that, future uses of the function will be faster. -==== - -Make sure to take into consideration the slower first run. What would the approximate total runtime be using the `JAX` function? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Don't worry, I'm not going to make you run these calculations. Instead, answer one of the following two questions. - -. Pick 2 images / image hashes and get the closest 3 images by Hamming distance for each. Note the distances and display the images. At those distances, can you perceive any sort of "closeness" in image? -. Randomly sample (using `JAX` methods) _n_ (more than 4, please) images and calculate all of the pairwise distances. Create a set of plots showing the distributions of distances. Explore the distances, and the dataset, and write 1-2 sentences about any interesting observations you are able to make, or 1-2 sentences on how you could use the information to do something cool. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project13.adoc deleted file mode 100644 index 14ac17774..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project13.adoc +++ /dev/null @@ -1,145 +0,0 @@ -= TDM 30200: Project 13 -- 2023 - -**Motivation:** This year you've been exposed to a _lot_ of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn't happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out. - -**Context:** This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn't be practical to make this exhaustive, but the idea is to _think_ about and _plan out_ the various steps in a project and execute it the best you can given time and resource constraints. - -**Scope:** Python - -.Learning Objectives -**** -- Describe and plan out an ETL pipeline to solve a problem of interest. -- Create a flowchart mapping out the steps in the project. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Create a problem statement for the project. What question are you interested in answering? What theory do you have that you'd like to show to maybe be true? This could be _anything_. Some examples could be: - -- Should you draft running backs before wide receivers in fantasy football? -- Are news articles more "positive" or "negative" on nytimes.com vs. washingtonpost.com? -- Are the number of stars of an Amazon review important to tell if the review is fake or not? -- Are flight delays more likely to happen in the summer or winter? - -The question you want to answer can be as simple or complex as you want it to be. - -[IMPORTANT] -==== -When coming up with the problem statement, please take into consideration that in this project, and the next, we will ask you to utilize skills you were exposed to this year. Things like: SLURM, `joblib`, `pytorch`, `JAX`, docker/singularity, `fastapi`, sync/async, `pdoc`, `pytest`, etc. It is likely that you will want to use other skills from previous years as well. Things like: web scraping, writing scripts, data wrangling, SQL, etc. - -Try to think of a question that _could_ be solved by utilizing some of these skills. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Read about ETL pipelines https://en.wikipedia.org/wiki/Extract,_transform,_load[here]. Summarize each part of the pipeline (extract, transform, and load) in your own words. Follow this up by looking at the image at the top of https://r4ds.had.co.nz/introduction.html[this] section of "R for Data Science". Where do you think the ETL pipeline could be added to this workflow? Read about Dr. Wickhams definition of https://r4ds.had.co.nz/tidy-data.html[tidy data]. After reading about his definition, do you think the "Tidy" step in the chart is potentially different than the "transform" step in the ETL pipeline? - -[NOTE] -==== -There are no correct answer to this question. Just think about the question and describe what you think. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Flowcharts are an incredibly useful tool that can help you visualize and plan a project from start to end. Flowcharts can help you realize what parts of the project you are not clear on, which could save a lot of work during implementation. Read about the various flowchart shapes https://www.rff.com/flowchart_shapes.php[here], and plan out your ETL pipeline and the remaining project workflow using https://www.draw.io/index.html[this] free online tool. xref:book:projects:templates.adoc#including-an-image-in-your-notebook[Include the image] of your flowchart in your notebook. - -[NOTE] -==== -You are not required to follow this flow chart exactly. You will have an opportunity to point out any changes you ended up making to your project flow later on. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -There will more or less be a few "major" steps in your project: - -- **Extract:** scrape, database queries, find and download data files, etc. -- **Transform:** data wrangling using `pandas`, `tidyverse`, `JAX`, `numpy`, etc. -- **Load:** load data into a database or a file that represents your "data warehouse". -- **Import/tidy:** Grab data from your "data warehouse" and tidy it if necessary. -- **Iterate:** Modify/visualize/model your data. -- **Communicate:** Share your deliverable(s). - -[NOTE] -==== -Of course, you don't _need_ to include all of these steps. Any well-planned approach will receive full credit. -==== - -This can be further boiled down to just a few steps: - -- Data collection/cleaning. -- Analysis/modeling/visualization. -- Report. - -Implement your data collection/cleaning step. Be sure to submit any relevant files and code (e.g. python script(s), R script(s), simply some code cells in a Jupyter Notebook, etc.) in your submission. - -To get full credit, simply choose at least 2 of the following skills to incorporate into this step (or these steps): - -- https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[Google style docstrings], or https://style.tidyverse.org/documentation.html[tidyverse style comments] if utilizing R. -- Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably). -- sync/async code (if, for example, you wanted to speed up code that has a lot of I/O). -- `joblib` (if, for example, you wanted to speed up the scraping of many files). -- `SLURM` (if, for example, you wanted to speed up the scraping of many files). -- `requests`/`selenium` (if, for example, you need to scrape data as a part of your collection process). -- If you choose to use `sqlite` as your intermediate "data warehouse" (instead of something easier like a csv or parquet file), this will count as a skill. -- If you use `argparse` and build a functioning Python script, this will count as a skill. -- If you write `pytest` tests for your code, this will count as a skill. - -[IMPORTANT] -==== -Make sure to include a screenshot or two actually _using_ your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -If you read about ETL pipelines, you are probably not exactly sure what a "data warehouse" is. Browse the internet and read about data warehouses. In your own words, summarize what a data warehouse is, and the typical components. - -Here are some common data warehouse products: - -- Snowflake -- Google BigQuery -- Amazon Redshift -- Apache Hive -- Databricks Lakehouse Platform - -Choose a product to read about and describe 2-3 things that it looks like the product can do, and explain why (or when) you think that functionality would be useful. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project14.adoc deleted file mode 100644 index dc1a7e32d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-project14.adoc +++ /dev/null @@ -1,112 +0,0 @@ -= TDM 30200: Project 14 -- 2023 - -**Motivation:** This year you've been exposed to a _lot_ of powerful (and maybe new for you) tools and concepts. It would be really impressive if you were able to retain all of it, but realistically that probably didn't happen. It takes lots of practice for these skills to develop. One common term you may hear thrown around is ETL. It stands for Extract, Transform, Load. You may or may not ever have to work with an ETL pipeline, however, it is a worthwhile exercise to plan one out. - -**Context:** This is the first of the final two projects where you will map out an ETL pipeline, and the remaining typical tasks of a full data science project, and execute. It wouldn't be practical to make this exhaustive, but the idea is to _think_ about and _plan out_ the various steps in a project and execute it the best you can given time and resource constraints. - -**Scope:** Python - -.Learning Objectives -**** -- Describe and plan out an ETL pipeline to solve a problem of interest. -- Create a flowchart mapping out the steps in the project. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -[WARNING] -==== -If you skipped project 13, please go back and complete project 13 and submit it as your project 14 submission. Please make a bold note at the top of your submission "This is my project 14, but is really project 13", so graders know what to expect. Thanks! -==== - -In the previous project, you _probably_ spent most of the time reading about ETL, flowcharts, data warehouses, and planning out your project. The more you have things planned out the less amount of time it will likely take to implement. Your project probably looks something like this now. - -* [x] Data collection/cleaning. -* [ ] Analysis/modeling/visualization. -* [ ] Report. - -In this project, you will complete those last two steps. - -Import data from your "data warehouse" and perform an analysis to answer the problem statement you created in the previous project. Your analysis should contain: - -- 1 or more data visualizations. -- 1 or more sets of summary data (think `.describe()` from `pandas` or `summary`/`prop.table` from R). - -[NOTE] -==== -Feel free to utilize the `transformers` package and the wide variety of pre-built models provided at https://huggingface.co/models[huggingface]. -==== - -Alternatively, you can build an API and/or dashboard using `fastapi` (or any other framework like `django`, `flask`, `shiny`, etc.). Simply make sure to include your code and screenshots of you utilizing the API or using the dashboard. - -For _either_ of the options above (summary data/visualizations or API/dashboard), in order to get full credit, simply choose at least 2 of the following skills to incorporate inot this step (or these steps): - -- https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html[Google style docstrings], or https://style.tidyverse.org/documentation.html[tidyverse style comments] if utilizing R. -- Singularity/docker (if, for example, you wanted to use a container image to run your code repeatably, or run your API/dashboard). -- sync/async code (if, for example, you wanted to speed up code that has a lot of I/O). -- `joblib` (if, for example, you wanted to speed up a parallelizable task or computation). -- `SLURM` (if, for example, you wanted to speed up a parallelizable task or computation). -- If you use `argparse` and build a functioning Python script, this will count as a skill. -- If you write `pytest` tests for your code, this will count as a skill. -- Use `JAX` (for example `jax.jit`) or `pytorch` for some numeric computation. - -[IMPORTANT] -==== -Make sure to include a screenshot or two actually _using_ your deliverable(s) in your notebook (for example, if it was a script, show some screenshots of your terminal running the code). In addition, make sure to clearly indicate which of the "skills" you chose to use for this step. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The final task, to create your deliverable to communicate your results, is probably the most important part of a typical project. It is important that people understand what you did, why it is important for answering your question, and why it provides value. Learning how to make a good slide deck is a really useful skill! - -In our case, it makes more sense to have a Jupyter Notebook, since those are easy to read in Gradescope, and get the point across. - -After your question 1 results are entered in your Jupyter Notebook, under the "Question 2" heading, create your deliverable. Use markdown cells to beautifully format the information you want to present. Include everything starting with your problem statement, leading all the way up to your conclusions (even if just anecdotal conclusions). Include code, graphics, and screenshots that are important to the story. Of course, you don't need to include code from scripts (in the notebook -- we _do_ want all scripts from question 1 (if any) included in your submission), but you can mention that you had a script called `my_script.py` that did X, Y, and Z. - -The goal of this deliverable is that an outsider could read your notebook (starting from question 2) and understand what question you had, what you did (and why), and what were the results. Any good effort will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -In the previous project, you were asked to create a flow chart to describe the steps in your system/project. As you began implementing things, you may or may not have changed your original plan. If you did, update your flowchart and include it in your notebook. Otherwise, include your old flow chart and explain that you didn't change anything. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -It has been a fun year. We hope that you learned something new! - -- Write 3 (or more) of your least favorite topics and/or projects from this past year (for TDM 30100/30200). -- Write 3 (or more) of your most favorite projects/topics, and/or 3 topics you wish you were able to learn _more_ about. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-projects.adoc deleted file mode 100644 index 13bc2faa8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/30200/30200-2023-projects.adoc +++ /dev/null @@ -1,46 +0,0 @@ -= TDM 30200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$30200-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2022 to Spring 2023. -==== - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm30100[https://piazza.com/purdue/fall2022/tdm30100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm30100/home[https://piazza.com/purdue/fall2022/tdm30100/home] - -== Syllabus - -Navigate to the xref:spring2023/logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project01.adoc deleted file mode 100644 index 4af495005..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project01.adoc +++ /dev/null @@ -1,372 +0,0 @@ -= TDM 40200: Project 1 -- 2023 - -**Motivation:** `JAX` is a Python library used for high-performance numerical computing and machine learning research. In the upcoming series of projects, this library will be used to build a small neural network from scratch. - -**Context:** This is the first project of the semester. We are going to start slowly by reviewing the `JAX` library, which we will use in the following series of projects. - -**Scope:** Python, `JAX`, `numpy` - -.Learning Objectives -**** -- Differentiate functions using `JAX`. -- Understand what "JIT" is and why it is useful. -- Understand when a value or operation should be static vs. traced. -- How does random number generation work in `JAX`? -- How to use `JAX` for basic matrix manipulation. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data` - -== Questions - -The following sets of documentation will be useful for this project. - -. https://jax.readthedocs.io/en/latest/index.html[JAX Documentation] -. https://numpy.org/doc/stable/user/absolute_beginners.html#[NumPy Documentation] - -=== Question 1 - -Use `JAX` to perform the following numeric operations. - -. Create and display a 2x3 array called `first` of the values 1 through 6 in row-major order. -. Create and display a 3x2 array called `second` where each element is the value 2. -+ -[TIP] -==== -Use the `JAX` equivalent of https://numpy.org/doc/stable/reference/generated/numpy.full.html#numpy.full[this] function. -==== -+ -. Multiply `first` and `second` together into a resulting `third` and display the result. -. Display the _shape_ of `third`. -. Multiply all values of `third` by 2 and display the result. -. Multiply `third`, element-wise, by the matrix formed from the values 1 through 4 in row-major order and display the result. -+ -[TIP] -==== -Use the `JAX` equivalent of https://numpy.org/doc/stable/reference/generated/numpy.multiply.html#numpy-multiply[this] function. -==== -+ -. Use the `JAX` equivalent of https://numpy.org/doc/stable/reference/generated/numpy.vstack.html#numpy-vstack[this] function to add a row to `second` containing the values 3 and 4. Save the result to `fourth`. -. Use the `JAX` equivalent of https://numpy.org/doc/stable/reference/generated/numpy.hstack.html#numpy-hstack[this] function to add a column to `fourth` containing the value 1 repeated 4 times. Save the result to `fifth`. -. Given the `JAX` array created by the following code, change the 1 to 7. -+ -[source,python] ----- -my_array = jnp.array([[2,2,2], [2,1,3]]) ----- -+ -. Read https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#in-place-updates[this section] of the `JAX` docs and explain (in your own words) why the following code that should solve the previous question does not work. -+ -[source,python] ----- -my_array[1,1] = 7 ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- A markdown cell containing an explanation on why `my_array[1,1] = 7` does not work in `JAX`. -==== - -=== Question 2 - -Read https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#random-numbers[this section] of the `JAX` documentation to recall the way you can generate random numbers using `JAX`. - -Check out the following code. - -[source,python] ----- -import numpy as np - -np.random.seed(0) - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo(): return bar() + 2 * baz() - -print(foo()) ----- - -If this were written using `JAX`, `JAX` may well attempt to parallelize the `bar` and `baz` functions to run at the same time. As a result, whether `bar` or `baz` ran first would be unclear. This, unfortunately, would change the results of `foo`, leaving the code difficult to replicate reliably. - -To illustrate this, `foo` could be the result of either of the following functions. - -[source,python] ----- -def foo1(): return bar() + 2*baz() - -# or - -def foo2(): return 2*baz() + bar() ----- - -If `bar` executes first (like in `foo1`), you will end up with a different result than if `baz` executes first (like in `foo2`). The following code illustrates this. - -[source,python] ----- -import numpy as np -import random - -def bar(): return np.random.uniform() -def baz(): return np.random.uniform() - -def foo1(): return bar() + 2*baz() - -def foo2(): return 2*baz() + bar() - -def foo(*funcs): - functions = list(funcs) - random.shuffle(functions) - return functions[0]() ----- - -Running the following will sometimes give you the result `1.9791922366721637`, and sometimes `1.812816374227069`. - -[source,python] ----- -np.random.seed(0) -foo(foo1, foo2) ----- - -The way `JAX` generates random values is different, and prevents such issues. At the same time, the way `JAX` generates random values is not as straightforward as `NumPy`. Fill in the `?` parts of the following code. The resulting code should reliably output the same value regardless of whether `bar` or `baz` executes first -- `2.3250647`. - -[source,python] ----- -import jax - -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=?) - -def bar(key): - return ? - -def baz(key): - return ? - -def foo1(key1, key2): - return bar(key1) + 2*baz(key2) - -def foo2(key1, key2): - return 2*baz(key2) + bar(key1) - -def foo(funcs, keys): - functions = list(funcs) - random.shuffle(functions) - return ? ----- - -[source,python] ----- -# the following code will always produce 2.3250647, regardless of whether bar or baz executes first -# this means this code is reproducible even in the scenario where the `bar` and `baz` functions are parallelized -key = jax.random.PRNGKey(0) -key, *subkeys = jax.random.split(key, num=3) -print(foo((foo1, foo2), (subkeys[0], subkeys[1]))) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -At the heart of `JAX` is the automatic differentiation system. This system allows `JAX` to compute gradients of functions, automatically. This is extremely powerful. Write a function called `my_function` that accepts the value `x` and returns the value `14x^3 + 13x`. Test it out given the value of `x=17`. - -Next, use the powerful `JAX` function `grad` to create a new function called `my_gradient` that accepts the value `x` and returns the gradient of `my_function` at `x`. Test it out given the value of `x=17`. What was the result? Does the result match the value when you plug `x=17` into the derivative of `my_function`? - -[IMPORTANT] -==== -`17` is an integer and `17.0` is a float. The `jax.grad` function requires real or complex valued inputs. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Another key utility in `JAX` is the `jit` function. JIT stands for just in time. Just in time compilation is a trick that can be used in some situations to greatly increase the speed or execution time of some code, by compiling it. The compiled version of the code has a myriad of optimizations applied to it that speeds up your code. - -Take the following, arbitrary code, and execute it in your notebook. - -[source,python] ----- -%%time - -key = jax.random.PRNGKey(0) - -def my_model(keys): - return 14*jax.random.normal(keys[0])**2 + 13*jax.random.normal(keys[1]) - -for i in range(100000): - key, *subkeys = jax.random.split(key, 3) - my_value = my_model(subkeys) ----- - -How long did it take? Now, use the `@jax.jit` https://realpython.com/primer-on-python-decorators/[decorator] to apply the `jit` transformation to your `my_model` function to use just in time compilation to speed up the code. Did it work? - -Well, actually, just slapping the `@jax.jit` decorator on the function is not good enough. Why? Because `JAX` has asynchronous dispatch by default. What this means is that, by default, `JAX` will return control to Python as soon as possible, even if it is _before_ the function has been fully evaluated. So while it may _appear_ as if all of the 100000 loops have been executed, in reality, they may not have been. - -To properly test if the JIT trick has sped things up, we need to _synchronously_ wait for our code to finish executing. This can be easily accomplished by using the built in `block_until_ready` method build into all JIT compiled functions. - -For example, the following code will _synchronously_ wait for the `my_func` function to finish executing. - -[source,python] ----- -@jax.jit -def my_func(): - return 1 - -my_func().block_until_ready() ----- - -Repeat the experiment but make sure we are synchronously waiting for the code to finish executing. How long did it take? Did it work? - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -At this point in time you may be thinking -- let's go! I can just slap `@jax.jit` on all my functions an make magically fast code! Well, not so fast. There are some caveats to using `JAX` and `jit` that you should be aware of. - -By default, `JAX` will try and "trace" parameters to determine their effect on inputs of a specific shape and type. In `JAX`, control flow _cannot_ depend on these "traced" values. For example, the following code will not work because `num_loops` is relied on in order to determine how many times to loop. - -[source,python] ----- -%%time - -def my_function(x, num_loops): - - for i in range(num_loops): - pass - - -fast_my_function = jax.jit(my_function) - -fast_my_function(14, 1000000) ----- - -What is the solution to this problem? How do we fix it? Well, this is not always possible, however, we _can_ choose to select certain arguments to be _static_ or not "traced". If a parameter is marked as static, or not "traced", it can be JIT compiled. The catch is that any time a call to the function is made and the value of any of the _static_ parameters is changed, the function will have to be recompiled with that new static value. So, this is only useful if you will only occasionally change the parameter. - -It just so happens that the provided snippet of code is a good candidate for this, as the user will only occasionally decide to change the number of "num_loops" when running the code! - -You can mark a parameter as static by specifying the argument position using the `static_argnums` argument to `jax.jit`, or by specifying the argumnet _name_ using the `static_argnames` argument to `jax.jit`. - -Force the `num_loops` argument to be static and use the `jax.jit` decorator to compile the function. Test out the function, in order, using the following code cells. - -[source,python] ----- -%%time - -def my_function(x, num_loops): - - for i in range(num_loops): - pass - -fast_my_function = jax.jit(my_function, static_argnums=(1,)) ----- - -[source,python] ----- -%%time - -fast_my_function(14, 1000000) ----- - -[source,python] ----- -%%time - -fast_my_function(14, 1000000) ----- - -[source,python] ----- -%%time - -fast_my_function(14, 999999) ----- - -Do your best to explain why the last code cell was once again slower. - -In addition, the _shapes_ or dimensions of all inputs and outputs must be able to be determined ahead of time. For example, the following will fail. - -[source,python] ----- -%%time - -def my_function(x, arr_cols): - - my_array = jnp.full((2, arr_cols), 5) - -fast_my_function = jax.jit(my_function) - -fast_my_function(5, 5) ----- - -You can, once again, fix this by specifying the static parameters. - -[source,python] ----- -%%time - -def my_function(x, arr_cols): - - my_array = jnp.full((2, arr_cols), 5) - -fast_my_function = jax.jit(my_function, static_argnums=(1,)) ----- - -And, once again, `JAX` will recompile every time that the static argument changes. - -[source,python] ----- -%%time - -# slow, first time compiling -fast_my_function(5, 5) ----- - -[source,python] ----- -%%time - -# fast, already compiled with static argument of 5 -fast_my_function(5, 5) ----- - -[source,python] ----- -%%time - -# slow, recompiling with static argument of 6 -fast_my_function(5, 6) ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project02.adoc deleted file mode 100644 index cf45284f4..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project02.adoc +++ /dev/null @@ -1,333 +0,0 @@ -= TDM 40200: Project 2 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the first in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Create a development environment to make building a dashboard on Anvil easier. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -In the previous project, we reviewed the `JAX` library in preparation for a series of projects focused around basic neural networks. At least for now, we are going to switch gears and learn about the various components of a dashboard. By the end of this series, you will have built a dashboard, complete with documentation. Your dashboard will be containerized. We will even deploy the dashboard, so you can click a link and have it appear. This series will try to move slowly, so you have time to get comfortable with each technology that is introduced. -==== - -=== Question 1 - -[WARNING] -==== -If at any time you get stuck, please make a Piazza post and we will help you out! -==== - -Dashboards can have many components that need to be setup, configured, and wired together. You have a backend server that runs on some _port_. You could have a database running on some other port. You have a frontend that needs to communicate with the backend. Maybe you have a redis instance running to cache requests. All of these things communicate with each other and sometimes in different ways. As a data scientist, you may need to work with all of these components. Things can get unwieldy very quickly. This is why taking the time to setup a _good_ development environment is critical. - -A development environment is more or less the set of tools you use to develop your project. You need to be able to quickly access and use said environment quickly, so you don't have to spend 30 minutes setting things up every time you want to make a change. It is highly likely that throughout your experience in the data mine, this has mostly looked something like: open a browser, fill out a form to launch Jupyter Lab, use Jupyter Lab for development. This is a perfectly fine solution for many things, however, building a dashboard is more involved, and using the tools we've historically used in the data mine is not ideal and would lead to a longer "code, run, observe, repeat" cycle. Of course, if you are comfortable using a terminals, ssh, and shell text editors, this can be manageable. However, these tools aren't the most accessible, and can be intimidating to new users. - -For this project, we will be doing something a little different in order to make the development experience on Anvil more pleasant. In addition, I imagine many of you will enjoy what we are going to setup and use it for other projects (or maybe even corporate partners projects). - -Typically, when developing a dashboard, you will have a set (or many sets) of code that you will update and modify. To see the results, you will run your server on a certain _port_ (for example 7777), and then interact with the API using a _client_. The most common client is probably a web browser. So if we had an API running on port 7777, we could interact with it by navigating to `http://localhost:7777` in our browser. - -This is not so simple to do on Anvil, or at least not very enjoyable. While there are a variety of ways, the easiest is to use the "Desktop" app on https://ondemand.anvil.rcac.purdue.edu and use the provided editor and browser on the slow and clunky web interface. This is not ideal, and is what we want to avoid. - -Don't just take our word for it, try it out. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Desktop" under "Interactive Apps". Choose the following: - -- Allocation: "cis220051" -- Queue: "shared" -- Wall Time in Hours: 1 -- Cores: 1 - -Then, click on the "Launch" button. Wait a minute and click on the "Launch Desktop" button when it appears. - -Now, lets copy over an example API and run it. - -. Click on Applications > Terminal Emulator -. Run the following commands: -+ -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 -cp -a /anvil/projects/tdm/etc/hithere $HOME -cd $HOME/hithere ----- -+ -. Then, find an unused port by running the following: -+ -[source,bash] ----- -find_port # 50087 ----- -+ -. In our example the output was 50087. Now run the API using that port (the port _you_ found). -+ -[source,bash] ----- -python3 -m uvicorn imdb.api:app --reload --port 50087 ----- - -Finally, the last step is to open a browser and check out the API. - -. Click on Applications > Web Browser -. First navigate to `localhost:50087` -. Next navigate to `localhost:50087/hithere/yourname` - -From here, your development process would be to modify the Python files, let the API reload with the changes, and interact with the API using the browser. This is all pretty clunky due to the slowness of the desktop-in-browser experience. In the remainder of this project we will setup something better. - -For this question, submit a screenshot of your work environment on https://ondemand.anvil.rcac.purdue.edu using the "Desktop" app. It would be best to include both the browser and terminal in the screenshot. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Before we begin, let's describe the setup we are going to use -- that way, you better understand what we are doing and why. - -In The Data Mine, one of our key goals is to be accessible. This means, we don't want anyone to rely on their own computing resources for anything. Otherwise, it can put students on unequal footing. For this reason, we make the extremely powerful Anvil cluster available to everyone in The Data Mine. However, like we just discussed, we don't currently have a great way to develop on Anvil. We are going to fix that. - -Web browsers are ubiquitous, and pretty much any personal computer you have can run one. In addition, https://code.visualstudio.com/[VS Code] is a free and open source text editor with a vibrant library of extensions. Like web browsers, VS Code can easily run on any personal computer. We are going to use these two tools, in combination with Anvil, to setup this development environment. - -VS Code and a browser (Chrome or Firefox would be best) are the only tools you will need to install on your own computer. We will connect VS Code to Anvil so your code lives on Anvil and even runs on Anvil. VS Code will automatically _forward ports_ to your local computer. This will allow you to use the browser on your local computer to access the server running on Anvil. This is a pretty cool setup, and will make your development experience much better! - -Install https://code.visualstudio.com/[VS Code] on your local machine. - -For this question, submit a screenshot of your local machine with a VS Code window open. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -As mentioned before, we are going to use VS Code on your _local_ machine to develop on Anvil. The answer is we are going to use a tool called `ssh` along with a VS Code extension to make this process seamless. - -Read through https://the-examples-book.com/starter-guides/unix/ssh[this] page in order to gain a cursory knowledge of `ssh` and how to create public/private key pairs. Generate a public/private key pair on your local machine and add your public key to Anvil. For convenience, we've highlighted the steps below for both Mac and Windows. - -**Mac** - -. Open a terminal window on your local machine. If you hold kbd:[Cmd+Space] and type "terminal" you should see the terminal app appear. -. In the terminal window, run the following command to generate a public/private key pair. -+ -[source,bash] ----- -ssh-keygen -a 100 -t ed25519 -f ~/.ssh/id_ed25519 ----- -+ -. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, and use an ssh agent, feel free). -. Display the public key contents, by running the following command. -+ -[source,bash] ----- -cat ~/.ssh/id_ed25519.pub ----- -+ -. Highlight the contents of the public key and copy it to your clipboard. For example, my public key looks like this. -+ ----- -ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPyj5eTyMIDOvlQdScPLn/s4SGLRuM//WXuW7mKYOYa8 ----- -+ -. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access". -. Once presented with a terminal, run the following. -+ -[source,bash] ----- -mkdir ~/.ssh -vim ~/.ssh/authorized_keys - -# press "i" (for insert) then paste the contents of your public key on a newline -# then press Ctrl+c, and type ":wq" to save and quit - -# set the permissions -chmod 700 ~/.ssh -chmod 644 ~/.ssh/authorized_keys -chmod 644 ~/.ssh/known_hosts -chmod 644 ~/.ssh/config -chmod 600 ~/.ssh/id_ed25519 -chmod 644 ~/.ssh/id_ed25519.pub ----- -+ -[NOTE] -==== -The `~/.ssh/authorized_keys` file is a special file where a newline-separated list of public keys are stored. If you have an associated private key on your local machine, you can use it to login to the machine _without_ typing a password. -==== -+ -. Now, confirm that it works by opening a terminal on your local machine and type the following. -+ -[source,bash] ----- -ssh username@anvil.rcac.purdue.edu ----- -+ -. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut". -. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool! - -**Windows** - -https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_keymanagement[This] article may be useful. - -. Open a powershell by right clicking on the powershell app and choosing "Run as administrator". Note that you may have to search for "powershell" in the start menu. -. Run the following command to generate a public/private key pair. -+ -[source,powershell] ----- -ssh-keygen -a 100 -t ed25519 ----- -+ -. Click enter twice to _not_ enter a passphrase (for convenience, if you want to follow the other instructions, and use an ssh agent, feel free). -. We need to make sure the permissions are correct for your `.ssh` directory and the files therein, otherwise `ssh` will not work properly. Run the following commands in a powershell (again, make sure powershell is running as administrator by right clicking and choosing "Run as administrator"). -+ -[source,powershell] ----- -# from inside a powershell -# taken from: https://superuser.com/a/1329702 -New-Variable -Name Key -Value "$env:UserProfile\.ssh\id_ed25519" -Icacls $Key /c /t /Inheritance:d -Icacls $Key /c /t /Grant ${env:UserName}:F -TakeOwn /F $Key -Icacls $Key /c /t /Grant:r ${env:UserName}:F -Icacls $Key /c /t /Remove:g Administrator "Authenticated Users" BUILTIN\Administrators BUILTIN Everyone System Users -# verify -Icacls $Key -Remove-Variable -Name Key ----- -+ -. Display the public key contents by running the following command. -+ -[source,powershell] ----- -type $env:UserProfile\.ssh\id_ed25519.pub ----- -+ -. Highlight the contents of the public key and copy it to your clipboard. For example, my public key looks like this. -+ ----- -ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPyj5eTyMIDOvlQdScPLn/s4SGLRuM//WXuW7mKYOYa8 ----- -+ -. Navigate to https://ondemand.anvil.rcac.purdue.edu and click on "Clusters" > "Anvil Shell Access". -. Once presented with a terminal, run the following. -+ -[source,bash] ----- -mkdir ~/.ssh -vim ~/.ssh/authorized_keys - -# press "i" (for insert) then paste the contents of your public key on a newline -# then press Ctrl+c, and type ":wq" to save and quit - -# set the permissions -chmod 700 ~/.ssh -chmod 644 ~/.ssh/authorized_keys -chmod 644 ~/.ssh/known_hosts -chmod 644 ~/.ssh/config -chmod 600 ~/.ssh/id_ed25519 -chmod 644 ~/.ssh/id_ed25519.pub ----- -+ -[NOTE] -==== -The `~/.ssh/authorized_keys` file is a special file where a newline-separated list of public keys are stored. If you have an associated private key on your local machine, you can use it to login to the machine _without_ typing a password. -==== -+ -. Now, confirm that it works by opening a powershell on your local machine and typing the following. -+ -[source,powershell] ----- -ssh username@anvil.rcac.purdue.edu ----- -+ -. Be sure to replace "username" with your _Anvil_ username, for example "x-kamstut". -. Upon success, you should be immediately connected to Anvil _without_ typing a password -- cool! - -For this question, just include a sentence in a markdown cell stating whether or not you were able to get this working. If it is not working, the next question won't work either, so please post in Piazza for someone to help! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Finally, let's install the "Remote Explorer" **and** "Remote SSH" extension in VS Code. These extensions will allow us to connect to Anvil from VS Code and develop on Anvil from our local machine. You can find instructions for browsing and installing extensions https://code.visualstudio.com/docs/editor/extension-marketplace[here]. - -Once installed, you should see an icon on the left-hand side of VS Code that looks like a computer screen. Click on it. - -In the new menu on the left, click the little settings cog. Select the first option, which should be either `/Users/username/.ssh/config` (if on a mac) or `C:\Users\username\.ssh\config` (if on windows). This will open a file in VS Code. Add the following to the file: - -.mac config ----- -Host anvil - HostName anvil.rcac.purdue.edu - User username - IdentityFile ~/.ssh/id_ed25519 ----- - -.windows config ----- -Host anvil - HostName anvil.rcac.purdue.edu - User username - IdentityFile C:\Users\username\.ssh\id_ed25519 ----- - -[IMPORTANT] -==== -On Windows, make sure to replace "username" with your _Anvil_ username, for example "x-kamstut". Do this both for the "User" section and the "IdentityFile" section in the ssh config file. -==== - -Save the file and close out of it. Now, if all is well, you will see an "anvil" option under the "SSH TARGETS" menu. Right click on "anvil" and click "Connect to Host in Current Window". Wow! You will now be connected to Anvil! Try opening a file -- notice how the files are the files you have on Anvil -- that is super cool! - -Open a terminal in VS Code by pressing `Cmd+Shift+P` (or `Ctrl+Shift+P` on Windows) and typing "terminal". You should see a "Terminal: Create new terminal" option appear. Select it and you should notice a terminal opening at the bottom of your vscode window. That terminal is on Anvil too! Way cool! Run the api by running the following in the new terminal: - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 -cd $HOME/hithere -python3 -m uvicorn imdb.api:app --reload --port 50087 ----- - -If you are prompted something about port forwarding allow it. In addition open up a browser on your own computer and test out the following links: `localhost:50087` and `localhost:50087/hithere/bob`. Wow! VS Code even takes care of forwarding ports so you can access the API from the comfort of your own computer and browser! This will be extremely useful for the rest of the semester! - -For this question, submit a couple of screenshots demonstrating opening code on Anvil from VS Code on your local computer, and accessing the API from your local browser. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -There are tons of cool extensions and themes in VS Code. Go ahead and apply a new theme you like and download some extensions. - -For this question, submit a screenshot of your tricked out VS Code setup with some Python code open. Have some fun! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project03.adoc deleted file mode 100644 index a1ba2c52f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project03.adoc +++ /dev/null @@ -1,285 +0,0 @@ -= TDM 40200: Project 3 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the second in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Create a simple backend server using `fastapi`. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -Start this project by opening VS Code and logging onto Anvil just like we did in the previous project. Once you have opened a session, make sure to pop open a terminal on Anvil inside VS Code by pressing kbd:[Cmd+Shift+P] (mac) or kbd:[Ctrl+Shift+P] (windows), and typing "Terminal: Create New Terminal".In the terminal, navigate to your `$HOME` directory, and open a terminal session. In the terminal session, run the following command: - -[source,bash] ----- -cd ----- - -Great. First thing is first. We need to create a new directory, that we will refer to as our "project directory" or "root directory". Call the directory `media_app`. One of our most complete databases is our `imdb` database, which you are all likely familiar with. For this reason, it makes the most sense to make our dashboard a media dashboard. - -[source,bash] ----- -mkdir media_app -cd media_app ----- - -Within our `media_app` directory, we want to organize all of our files and folders. First, create a `backend` directory. This is where we will keep the source code and critical files that we will use to run our backend. We will use the terms "backend", "webserver", "server", and "api" interchangeably. These are all very common terms for the same thing. - -[source,bash] ----- -mkdir backend -cd backend ----- - -We are going to start off slowly, so we can take our time and understand the critical parts pretty well. Create a new Python mondule called `main.py` and drop it in your `backend` directory, and open it up in VS Code. - -Drop the following code into `main.py`, and save it. - -[source,python] ----- -from fastapi import FastAPI <1> - -app = FastAPI() <2> - - -@app.get("/") <3> -async def hello(): <4> - return {"message": "Hello World"} ----- - -This is one of the simplest apis you can write with `fastapi`. Here are some descriptions, line-by-line. - -<1> We are importing the `FastAPI` class from the `fastapi` package. This is the class that we will use to create our api. -<2> We are creating an instance of the `FastAPI` class, called `app`. This is the object that we will use to run our api, and assign functions to endpoints. -<3> We are using the `app` object to assign a function to the `/` endpoint. This means that when a user visits the root of our api, they will be served the output of the function. For example, if our api is running at `http://localhost:8000/`, then when a user visits `http://localhost:8000/`, they will be served the output of the `hello` function. -<4> The `hello` function. This function is very simple and returns a dict with _key_ "message" and _value_ "Hello World". This is the output that will be served to the user when they visit the root of our api. `fastapi` will automatically convert the dict to JSON. When displayed in the browser, JSON data will be displayed in a human-readable format, that looks distinctly different than when you'd visit a regular web page like https://datamine.purdue.edu. - -This is all great, but we need to _run_ our api in order to see it in action. To do this, we need to execute our `main.py` module. In order to run the `main.py` module, we need to make sure the `fastapi` package is available in our Python environment, otherwise we will receive an error when we try to import it. Luckily, we already have `fastapi` installed in our `f2022-s2023` environment, we just need to load it up. To load up our `f2022-s2023` environment, please run the following in the terminal. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Before running the commands above, if we were to run `which python3` in the terminal, it would point to our _system_ Python on Anvil, which we don't have any control over -- `/usr/bin/python3`. After running the commands above, running `which python3` will point to a shell function which executes our Python code in a singularity container. This version of Python is loaded up with _lots_ of packages, ready for you to use, including `fastapi`. - -Normally, to run a Python module, we would just run a command like: - -[source,bash] ----- -python3 main.py ----- - -However, we are not using the _Python_ interpreter to run our module, we are using a Python ASGI web server called https://github.com/encode/uvicorn[uvicorn]. Uvicorn is a very popular web server, and is used by many popular Python web frameworks, including `fastapi`. To run our module with uvicorn, we need to run the following command. - -[source,bash] ----- -python3 -m uvicorn main:app --reload ----- - -When running a server, you must choose a _port_ to run the server on. A _port_ can have a value from 1 to 65535, however, many of those ports are reserved for certain programs. Each port can only have a program utilizing UDP and TCP protocols on them. Only 1 TCP and 1 UDP per port. By default, `fastapi` will choose to run your server on port 8000, using the TCP protocol. If someone else happens to be on the same Anvil node as you, and is using port 8000 (a rather popular port number), you will receive an error -- that port is already in use. - -We've created a script to print out an unused port. Run the following command. - -[source,bash] ----- -find_port ----- - -.potential output ----- -39937 ----- - -Now, we can use extra `uvicorn` arguments to specify the port we want to run out app on. - -[source,bash] ----- -python3 -m uvicorn main:app --reload --port 39937 ----- - -Now, we can visit our api in the browser. Visit `http://localhost:39937/` in your browser. - -[NOTE] -==== -You may have to click "allow" in a VS Code popup asking about forwarding ports. This just makes it so you can go to port 39937 (for example) on your _own_ computer's browser, and you will essentially be on Anvil's port 39937. -==== - -You should see the following output. - -image::figure39.webp[Expected output, width=792, height=500, loading=lazy, title="Expected output"] - -Great! What does each of the parts of the command mean? - -[source,bash] ----- -python3 -m uvicorn main:app --reload --port 39937 ----- - -The `python3 -m uvicorn` part is just a way to access and run the installed `uvicorn` app. - -The `main:app` part is telling `uvicorn` which module and which object to run. In this case, we are telling `uvicorn` to run the `app` object in the `main.py` module. If your current working directory when running the command was in `media_app` instead of in `backend`, you would need to run a slightly modified command. - -[source,bash] ----- -python3 -m uvicorn backend.main:app --reload --port 39937 ----- - -Here, `backend.main` translates to in the `backend` directory in the `main.py` module. To reiterate, the `app` object is the object that we are using to run our api, and assign functions to endpoints. For instance, if we modified our code to be the following. - -[source,python] ----- -from fastapi import FastAPI - -my_app = FastAPI() - - -@my_app.get("/") -async def hello(): - return {"message": "Hello World"} ----- - -Then, we would have to modify our command to be the following. - -[source,bash] ----- -python3 -m uvicorn backend.main:my_app --reload --port 39937 ----- - -The `--port` command is more obvious -- it is picking which port we want to run the server on. - -Finally, the `--reload` command is telling `uvicorn` to reload the server whenever we make a change to our code. This is very useful for development, but should be removed when we are ready to deploy our app. Let's test it out. **While your app is still running**, change the _key_ of the returned dict from "message" from "communique". Save the `main.py` file, and refresh your browser. You should see the following output, and you didn't even need to restart the server! - -image::figure40.webp[Expected output, width=792, height=500, loading=lazy, title="Expected output"] - -This is useful. This means that you'll typically just need to run the server and keep it running as you develop your api. - -To verify that you mostly understand all of this, please provide the command you would use to run the backend if your current working directory was your `$HOME` directory. Put your solution in a markdown cell in a Jupyter Notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -In the previous question, we assigned the `hello` function to the `/` endpoint. That way, when you are running the server and navigate to `http://localhost:39937/`, you will see our "Hello World" message. Let's add more endpoints to our backend, so we can better understand how this works. - -First, instead of accessing our "Hello World" message by going to `http://localhost:39937/`, let's access it by going to `http://localhost:39937/hello`. Make the required modification and demonstrate that it works. Submit a screenshot of your browser and the response from the server. The screenshot should include the URL and the response -- just like my screenshots in the previous question did. Forget how to include an image in your notebook? See https://the-examples-book.com/projects/current-projects/templates#including-an-image-in-your-notebook[here]. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Fantastic. Let's add another endpoint. This time, let's add an endpoint that takes a name and returns a message, just like before, except now instead of "Hello World", the message should be "Hello NAME", where NAME is the name that was passed in. For example, if we passed in the name "drward" to the endpoint, we would expect the following. - -image::figure41.webp[Expected output, width=792, height=500, loading=lazy, title="Expected output"] - -[TIP] -==== -https://fastapi.tiangolo.com/tutorial/path-params/[This] does an excellent job explaining how to add a path parameter, and what is happening. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -In the previous question, you stripped the _path parameter_, and passed it to your function. For example, if you had the following code. - -[source,python] ----- -@app.get("/some/endpoint/{some_argument}") -async def some_function(some_argument: str): - return {"output": some_argument} ----- - -If you navigated to `http://localhost:39937/some/endpoint/this_is_my_argument`, you would see the following output. - -.output ----- -{"output": "this_is_my_argument"} ----- - -The function receives the value from the `@app.get("/some/endpoint/{some_argument}")` line, and it passes the value, in our example, "this_is_my_argument" to the `some_function` function. This equates to something like the following. - -[source,python] ----- -some_function("this_is_my_argument") ----- - -That function simply returns the dict, which is quickly transformed to JSON and returned as the response. - -Well, not _all_ values need to be passed _from_ the URL _to_ some function through _path parameters_. _Path parameters_ are typically used when the information you want to pass through a path parameter has something to do with the structure of the data. For example, our endpoint `/hello/NAME` doesn't make a whole lot of sense. Names are not unique, and if we had multiple drwards, we couldn't access both of their information from the same endpoint. However, if you had something like `/users/123/hello`, then it would make sense. The `123` could be a unique identifier for a user, and the `hello` endpoint could return a customized hello message for that specific user. - -If you wanted an endpoint to say hello to any old person -- not necessarily to a individual in your database, for instance, then there is another way that makes lots more sense that using a custom endpoint like `/hello/NAME`. - -Instead, you can use a _query parameter_. A _query parameter_ is a parameter that is passed through the URL, but is not part of the path. https://fastapi.tiangolo.com/tutorial/query-params/[This] does an excellent job explaining what a query parameter is, and how to use them in `fastapi`. - -Update your `/hello/` endpoint to accept a query parameter called `name`. The endpoint should still return the same message, but this time it should use the query parameter instead of the path parameter. Demonstrate that it works by submitting a screenshot of your browser and the response from the server. The screenshot should include the URL and the response. Below is an example, passing "drward" -- please choose a different name for your example. - -image::figure42.webp[Expected output, width=792, height=500, loading=lazy, title="Expected output"] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -So, you've typed 'http://localhost:39937/hello?name=drward' into your browser, and you've seen the message "Hello drward". That's great, but it is time to define some concepts. When you type that URL into your browser and hit enter, what is happening? Your browser makes a `GET` _request_. `GET` is one of the https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods[`HTTP` methods]. `GET` is used to retrieve information from a server. In this case, we are retrieving information from our server. This information just happens to be some JSON with a "Hello World" message. - -The "Hello World" message is part of the _response_. The _response_ is the information that is returned from the server. A https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages#http_responses[_response_] has three primary components: a status line, one or more https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers[_headers_], and a https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages#body_2[_body_]. - -Use your `/hello?name=person` endpoint. At the top of your browser, you should see "JSON" (what you are used to seeing), "Raw Data" and "Headers". In a markdown cell, copy and paste the "Raw Data" (your response _body_), and the "Headers" (your response _headers_). - -Next, modify your `hello` endpoint to return an additional response _header_ with _key_ "attitude" and _value_ "sassy". In addition, change the https://developer.mozilla.org/en-US/docs/Web/HTTP/Status[status code] from 200 to 501. Demonstrate that it works by submitting screenshots of your browser and the response from the server. The screenshot should include the URL and the response. Below is an example, passing "drward" -- please choose a different name for your example. - -[TIP] -==== -Dig around in the https://fastapi.tiangolo.com/[offical docs] to figure out how to add a response header, and how to change the status code. -==== - -[TIP] -==== -In order to see the status code from the browser, you will need to open the Inspector and click on "Console". You may need to make the request again (refresh the web page) in order to see the status code. -==== - -image::figure43.webp[Response body example, width=792, height=500, loading=lazy, title="Response body example"] - -image::figure44.webp[Response headers example, width=792, height=500, loading=lazy, title="Response headers example"] - -image::figure45.webp[Status code example, width=792, height=500, loading=lazy, title="Status code example"] - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project04.adoc deleted file mode 100644 index 3b6e754ba..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project04.adoc +++ /dev/null @@ -1,226 +0,0 @@ -= TDM 40200: Project 4 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the third in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -In the previous project we covered some of the most basic but important parts of a backend. It is time to take a break from `fastapi` -- before continuing on, it is important that we introduce probably the most critical component of a dashboard or web app -- the database. - -In this project we will be using `sqlite3` as our database. `sqlite3` is very simple to use, but still extremely powerful. Typically, however, an instance of `postgresql` or `mysql/mariadb` is more common. However, for this project, we will be using `sqlite3` as it is the easiest to get started with. We _will_ however explain the steps that would be needed if we were using either `postgresql` or `mariadb`. - -Our code is Python code. There are a lot of Python packages that can be used to interact with the databases we've mentioned. We will be using https://nackjicholson.github.io/aiosql/[`aiosql`] as it is a very straightforward package that allows us to write SQL queries in a `.sql` file, and then use those queries in our Python code. This is quite different than most of the other Python tools. Most of the Python tools -- like those used in Django, or `sqlalchemy` or `peewee` -- require us to write our SQL queries in Python code, and use special methods to execute those queries. While this isn't _bad_, and in fact, it can be very very good, however, it can be easier to maintain a project if we separate our SQL queries from our Python code -- this is what `aiosql` _largely_ let's us accomplish. - -In this project, we will learn how to use `aiosql`. - -Get started by opening up VS Code and connecting to Anvil, just as we have in the previous projects. One database you've used many times before is our `imdb` database. Please create a copy of this database in your `$SCRATCH` directory. You can do this by running the following command. - -[source,bash] ----- -mkdir $SCRATCH/p4 -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH/p4 ----- - -In addition, create two additional files. - -[source,bash] ----- -touch $SCRATCH/p4/queries.sql -touch $SCRATCH/p4/project04.py ----- - -Finally, open `project04.py` and add the following code. - -[source,python] ----- -def main(): - print("Hello World!") - -if __name__ == "__main__": - main() ----- - -Make sure things are working by loading up our Python environment and running `project04.py`. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 - -cd $SCRATCH/p4 -python3 project04.py ----- - -Capture a screenshot of the resulting output and include it in your Jupyter Notebook for submission. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -https://nackjicholson.github.io/aiosql/[Here] are the official docs for `aiosql`. Please reference them as needed. - -The first step, no matter what database you are using, is to load up the queries from our (currently empty) `queries.sql` file. Please glance over https://nackjicholson.github.io/aiosql/database-driver-adapters.html[this] page. - -As you can see, `aiosql` supports a number of different database drivers: `sqlite3`, `apsw`, `psycopg`, `psycopg2`, `pymysql`, etc. Please follow the following steps to load up our queries, and establish a database connection. - -. Import `aiosql` and `sqlite3`. In this case, `sqlite3` is the database driver we are using. If you were instead using `postgresql`, you would likely import `psycopg2` instead. -. Next, you need to make a call to `aiosql` `from_path` method. This method takes two arguments -- the first is a string that describes a path to the file containing our queries, in our case, `queries.sql`. In this case, since our `project04.py` module and `queries.sql` files are in the same directory, this value can simply be "queries.sql". The second argument is the database driver we are using. In our case, this is `sqlite3`. If we were using `postgresql` and `psycopg2`, this would be `psycopg2`. You can name the resulting variable anything you want. For clarity, I tend to prefer `queries`. This resulting `queries` object will contain a _method_ for each and every query we have in our `queries.sql` file. We will explain this more, later. -. Finally, when making a query using `aiosql`, we need to establish a database connection object and pass that object along to each query we call. To establish a database connection, you need to follow the instructions for your database driver. In our case, we are using `sqlite3`. So, I would search the internet for "establish connection sqlite3 python" and find the following results: https://docs.python.org/3/library/sqlite3.html. We can very clearly see, that to establish a connection, we can run the following code. -+ -[source,python] ----- -import sqlite3 -conn = sqlite3.connect("imdb.db") ----- -+ -Of course, we need to make sure that `imdb.db` is in the same directory as our `project04.py` module. If it isn't, we would need to adjust the _path_ of the database accordingly. Here, the resulting `conn` is our _connection object_. We will need to pass this object to every query we make using `aiosql` -- it will always be the first argument. -+ -[NOTE] -==== -To create a connection using `psycopg2`, for example, this would look a bit different. - -[source,python] ----- -import psycopg2 -conn = psycopg2.connect(host="my.db.location.example.com", database="mydbname", user="myusername", password="mypassword", port=5432) ----- - -Here, we would have to specify more details as `postgresql` is a client/server database and we need to authenticate. In addition, we have to specify _where_ (the host) the database is hosted and what port it is listening on. -==== - -Finally, its time to put all of this information to use! Carefully read https://nackjicholson.github.io/aiosql/defining-sql-queries.html[this] page. In your `queries.sql` file, write a query called `get-five-titles` that runs a `SELECT` query returning 5 titles. Update your `main` function to load your queries from the `queries.sql` file, establishes a connection to the `imdb.db` `sqlite3` database, and executes the newly created query, printing the results. From the terminal, run the updated `project04.py` module and capture a screenshot of the resulting output and include it in your Jupyter Notebook for submission. - -[TIP] -==== -If all went well you should end up with something like: - -.output ----- -[('tt0000001', 'short', 'Carmencita', 'Carmencita', 0, 1894, None, 1, 'Documentary,Short'), ('tt0000002', 'short', 'Le clown et ses chiens', 'Le clown et ses chiens', 0, 1892, None, 5, 'Animation,Short'), ('tt0000003', 'short', 'Pauvre Pierrot', 'Pauvre Pierrot', 0, 1892, None, 4, 'Animation,Comedy,Romance'), ('tt0000004', 'short', 'Un bon bock', 'Un bon bock', 0, 1892, None, 12, 'Animation,Short'), ('tt0000005', 'short', 'Blacksmith Scene', 'Blacksmith Scene', 0, 1893, None, 1, 'Comedy,Short')] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Next, write a new query called `get-title-by-id` that takes a single argument, `title_id`, and returns the title (and only the `primary_title`) with the matching `title_id`. Update your `main` function to load your queries from the `queries.sql` file, establishes a connection to the `imdb.db` `sqlite3` database, and executes the newly created query, printing the results. From the terminal, run the updated `project04.py` module and capture a screenshot of the resulting output and include it in your Jupyter Notebook for submission. - -[TIP] -==== -Here are some example queries with expected output. - -[source,python] ----- -results = queries.get_title_by_id(conn, title_id="tt4236770") -print(results) ----- - -.expected output ----- -[('Yellowstone',)] ----- - -[source,python] ----- -results = queries.get_title_by_id(conn, title_id="tt0108778") -print(results) ----- - -.expected output ----- -[('Friends',)] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Carefully read https://nackjicholson.github.io/aiosql/defining-sql-queries.html#operators[this section] if you haven't already. Now it is time to insert a new title into our `titles` table! - -Write a new query called `insert-title` that takes the following arguments: `title_id`, `type`, `primary_title`, `original_title`, `is_adult`, `premiered`, `ended`, `runtime_minutes`, and `genres`. The query should insert a new row into the `titles` table with the provided values. - -Use your new query to insert the following title into the `titles` table: https://www.imdb.com/title/tt3581920/. Make sure `title_id` is `tt3581920`, however, if you can't find any of the other pieces of data, feel free to make them up. - -Test out your new query from within your `main` function. From the terminal, run the updated `project04.py` module. Be sure to use the `get_title_by_id` method to fetch and print the newly added title to confirm your `INSERT` worked properly. Capture a screenshot of the resulting output and include it in your Jupyter Notebook for submission. - -[TIP] -==== -Example output. - -[source,python] ----- -result = queries.insert_title(conn, title_id="tt3581920", ...) -print(result) ----- - -.expected output ----- -1 ----- - -[source,python] ----- -result = queries.get_title_by_id(conn, title_id="tt3581920") -print(result) ----- - -.expected output ----- -[('The Last of Us',)] ----- -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Great job! I hope you will start to see the advantages of having all your queries in a single place. Write an additional query using a different https://nackjicholson.github.io/aiosql/defining-sql-queries.html#operators[operator] than you've used so far. Demonstrate that your query functions as it should by executing it from within your `main` function. From the terminal, run the updated `project04.py` module and capture a screenshot of the resulting output and include it in your Jupyter Notebook for submission. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project05.adoc deleted file mode 100644 index 877cb36d8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project05.adoc +++ /dev/null @@ -1,230 +0,0 @@ -= TDM 40200: Project 5 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the fourth in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -In the previous projects, our functions would typically return a `dict`, which, if we paired it with a `JSONResponse`, would return a clean JSON object, displayed in the browser. - -However, crafting every response in this manner is not a great idea. API's need to be consistent and predictable. It is very easy to make a mistake and return a value that is the wrong type, or a value that is not expected. This is where `pydantic` can greatly help us out. `fastapi` is structured specifically to work with `pydantic` models. In addition, one of the _most_ critical parts of any application is the data model. One should take a lot of time considering how data is structured and flows through your application. Working with `pydantic` and `fastapi` will help you to do this. - -[TIP] -==== -https://docs.pydantic.dev/[Here] are the official docs for `pydantic`. -https://fastapi.tiangolo.com/python-types/#python-types-intro[Here] is the section of the `fastapi` docs that discusses Python types as well as `pydantic` models (towards the bottom of the page). -==== - -[NOTE] -==== -In both the `pydantic` and `fastapi` docs, you will sometimes have the choice of choosing which version of Python you are using. Please choose "Python 3.10 and above", however, due to the version of `pydantic` we are currently using, you may have to choose "Python 3.7 and above". For example, in "Python 3.10 and above" you should be able to have something like this. - -[source,python] ----- -class User(BaseModel): - id: int | str ----- - -Here, the `id` field can be either an `int` or a `str`. However, due to the version of `pydantic` we are using, this behavior isn't supported. However, in "Python 3.7 and above", you will have to do something like this. - -[source,python] ----- -from typing import Union - -class User(BaseModel): - id: Union[int, str] ----- - -This would work using our `f2022-s2023` Python. The point here is, if there is an error saying something about "some type not supported" -- please try an "older" method of doing things. -==== - -Use the code below as a starting point. Create a `pydantic` model to handle titles like we would have in our `titles` table from the previous project. Unpack the following set of data into the `pydantic` model. What happens when you try to load it into a `Title` object? Modify your `pydantic` `Title` model to accept the data. - -For this project, you can use Jupyter Lab. No need to use our VS Code setup. Please make sure to run all cells so the results are displayed. - -[source,python] ----- -# create pydantic model for titles here. - -def main(): - # load data into pydantic model here. - -if __name__ == "__main__": - main() ----- - -.first set of data ----- -first = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": False, "premiered": 2023, "ended": None, "runtime_minutes": 60, "genres": "Action,Adventure,Drama"} ----- - -Great! There are a lot of ways you can craft your `pydantic` models. You can make certain fields "optional" where the value can either be _some type_ or `None`. You can use https://docs.pydantic.dev/usage/types/#unions[Unions] to specify multiple valid types. You can even specify good default values! - -[TIP] -==== -Hint hint: https://docs.pydantic.dev/usage/types/#unions[Here] is a link to the docs for `Unions`, which will be useful for loading up the first set of data. -==== - -Try loading the following set of data into your `Title` type. Pay close attention to the `is_adult` field _before_ and _after_ you load the data into the `Title` type. Same for the `premiered` field. Do your best to explain what is happening. - -.second set of data ----- -second = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": 0, "premiered": "2023", "ended": None, "runtime_minutes": 60, "genres": "Action,Adventure,Drama"} ----- - -Finally, `pydantic` models _validate_ your data -- this means that you'll get a very nice description of _why_ your data is incorrect, if it is incorrect. Try loading the following set of data into your `Title` type. Does it give you an easy to understand error message? - -.third set of data ----- -third = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": 0, "premiered": "2023", "ended": None, "runtime_minutes": "60 minutes", "genres": "Action,Adventure,Drama"} ----- - -[TIP] -==== -The very first code example https://docs.pydantic.dev/[here] will demonstrate how to take a `dict` and load it into a `pydantic` model. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -As you may have gathered after experimenting with `pydantic` in the previous question, `pydantic` will _try_ to convert to the desired, correct type, if possible. Otherwise you will "fail fast" and receive a nice, detailed error message. If you didn't use a tool like `pydantic`, a customer using your API may receive some _very_ unexpected behavior. For example, if your API would normally return an integer, but for some reason it returned a string instead, your customer's code, which could be written in a completely different programming language, could break. This is why it is important to validate your data. - -Take the following set of data containing the title info for "The Last of Us". - -[source,python] ----- -first = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": False, "premiered": 2023, "ended": None, "runtime_minutes": 60, "genres": "Action,Adventure,Drama"} ----- - -While you built a `pydantic` model to handle this data, your model is _likely_ not ideal, yet. Take a look at the `genres`. In our example it is: "Action,Adventure,Drama". However, the way our data is stored it could also be "Drama,Adventure,Action" or "Action,Romance", or any combination of a variety of different genres. `genres` is really a _list_, not a string. Why don't we build up our data model to handle this? - -Modify your `Title` model so that `genres` is a _list_ of `str`. Take the `first` `dict` above, and make any modifications that are needed so the data is loaded into the `Title` model correctly. Once you have done this, print out the `Title` object to show that it is working correctly. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -So far so good. While this project may be underwhelming in terms of a "wow" factor -- we are just messing around with data and types -- it is very important, and a good habit to practice. Using tools that validate your data will save you a lot of time and headaches in the future. - -Well, our plan is to utilize `pydantic` as a part of our backend, right? Well, where will our data come from? Our database! What are we using to get data from our database? `aiosql`! The next task is to use `aiosql` to load data from our database, and then use `pydantic` to convert that data into a `Title` object. - -Start by establishing a connection to the database, and making a query. - -[source,ipython] ----- -%%bash - -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH ----- - -.queries.sql ----- --- name: get-title-by-id --- Given a title id, return the matching title. -SELECT * FROM titles WHERE title_id=:title_id; ----- - -[source,python] ----- -import aiosql -import sqlite3 - -queries = aiosql.from_path("queries.sql", "sqlite3") -conn = sqlite3.connect("/anvil/scratch/x-kamstut/imdb.db") # replace x-kamstut with your username - -results = queries.get_title_by_id(conn, title_id="tt0108778") ----- - -Now, take `results` and convert it to a `Title` `pydantic` model. Print out the `Title` object to show that it is working correctly. - -[TIP] -==== -First, you will want to end up creating a `dict` where the keys are the same as the keys in the `Title` model. The follow code is a way to access the keys of the `Title` model. - -[source,python] ----- -Title.__fields__.keys() ----- -==== - -[TIP] -==== -Don't forget to convert the `genres` field to a list of strings. You can use the `split` method to do this. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -`pydantic` makes it easy to export your data to a variety of useful formats. Take your resulting `Title` object from the previous question, and demonstrate converting the model to a `dict`, a `json` string, and finally, demonstrate saving the model using the `pickle` package. Be sure to print out the results of each conversion. - -[TIP] -==== -There is a https://docs.pydantic.dev/usage/exporting_models/[whole page] about this functionality in the documentation. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Finally, one other useful feature of `pydantic`, is the ability to write _custom_ validators for your data. For example, if you wanted to make sure that the `premiered` date was before the `ended` date, you could write a custom validator to do this. In fact, this is exactly what we are going to do! - -Read https://docs.pydantic.dev/usage/validators/[this page] in the documentation. Update your `Title` model to include a custom validator called `sane_dates` that will check that the `premiered` date is before the `ended` date. Test out your validator by attempting to load the following two sets of data into a `Title` object. The first one should fail with a clear message, and the last one should succeed. Be sure to include the output in your notebook cells. - -.failure data ----- -failure = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": False, "premiered": 2023, "ended": 2000, "runtime_minutes": 60, "genres": "Action,Adventure,Drama".split(",")} ----- - -.success data ----- -success = {"title_id": "tt3581920", "type": "tvseries", "primary_title": "The Last of Us", "original_title": "The Last of Us", "is_adult": False, "premiered": 2023, "ended": 2030, "runtime_minutes": 60, "genres": "Action,Adventure,Drama".split(",")} ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project06.adoc deleted file mode 100644 index f780ffa17..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project06.adoc +++ /dev/null @@ -1,163 +0,0 @@ -= TDM 40200: Project 6 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the fifth in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -- Learn how `fastapi` and `pydantic` work together to create endpoints that validate data and return typed responses. -- Create a directory structure to assemble a `fastapi` backend. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -Create a directory in your `$SCRATCH` directory called `imdb`. This will be the "monorepo" that holds your source code for your dashboard -- the frontend, the backend, and any other assets we may end up with. Create all the empty files and folders so your directory structure looks like the following. - -.project directory structure ----- -imdb -├── backend -│   ├── api -│   │   ├── api.py -│   │   ├── database.py -│   │   ├── imdb.db -│   │   ├── pydantic_models.py -│   │   └── queries.sql -│   ├── pyproject.toml -│   ├── README.md -│   └── templates -├── frontend -└── README.md - -4 directories, 8 files ----- - -Once complete, demonstrate you've created everything properly by running the following in Jupyter Notebook cell. - -[source,ipython] ----- -%%bash - -tree $SCRATCH/imdb ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Okay! You are all set now. You have a sane project structure, and you (hopefully) learned all you needed to perform this next task. - -First, we will go ahead and give you the contents of `database.py`. - -[source,python] ----- -import os -import aiosql -import sqlite3 -from dotenv import load_dotenv -from pathlib import Path - -load_dotenv() - -database_path = Path(__file__).parents[0] / "imdb.db" -queries = aiosql.from_path(Path(__file__).parents[0] / "queries.sql", "sqlite3") ----- - -That way, from another Python module, you can import and use `database_path` and `queries` to interact with the database. - -[source,python] ----- -from .database import database_path, queries ----- - -Implement you backend and create the following endpoints in your `api.py` file: - -. `GET /titles/{title_id}` -- returns the title information as a `Title` object for the `Title` with the given `title_id`. -. `GET /cast/{title_id}` -- returns the cast information as a `Cast` object for the `Title` with the given `title_id`. -. `GET /people/{person_id}` -- returns the person information as a `Person` object for the `Person` with the given `person_id`. - -Fill in the `pydantic_models.py` file with the following models: `Title`, `Cast`, `CastMember`, `Work`, and `Person`. - -Fill in `queries.sql` with your `aiosql` queries. While there are multiple ways to do this, I used the following 4 queries: `get_title`, `get_cast`, `get_person`, and `get_work`. The former 2 accept a `title_id` as a parameter, and the latter 2 accept a `person_id` as a parameter. - -The following are screenshots from calling the following endpoints. - -image::figure46.webp[Call to /titles/tt2194499, width=792, height=500, loading=lazy, title="Call to /titles/tt2194499"] - -image::figure47.webp[Call to /cast/tt2194499, width=792, height=500, loading=lazy, title="Call to /cast/tt2194499"] - -image::figure48.webp[Call to /people/nm1046097, width=792, height=500, loading=lazy, title="Call to /people/nm1046097"] - -Please replicate each of these screenshots with your _own_ screenshot for the following **different** endpoints. - -- `GET /titles/tt1754656` -- `GET /cast/tt1754656` -- `GET /people/nm0748620` - -[TIP] -==== -Don't forget you can run the following to load up our environment. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -In addition, you can run the following to find an unused port. - -[source,bash] ----- -find_port # for example, 7777 ----- - -Then, to run your backend from your `$SCRATCH/imdb` directory, you can run the following. - -[source,bash] ----- -python3 -m uvicorn backend.api.api:app --reload --port 7777 ----- -==== - -[TIP] -==== -You can _nest_ `pydantic` models. For example, in my `Person` model, I have a `list[Work]` field. This is a list of `Work` objects. -==== - -.Items to submit -==== -- The entire directory, with all files and folders in the `imdb` directory. -- A jupyter notebook containing the screenshots demonstrating the working endpoints. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project07.adoc deleted file mode 100644 index 2dd8161b2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project07.adoc +++ /dev/null @@ -1,250 +0,0 @@ -= TDM 40200: Project 7 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the sixth in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -- Learn how `fastapi` and `pydantic` work together to create endpoints that validate data and return typed responses. -- Create a directory structure to assemble a `fastapi` backend. -- Use `fastapi` and `jinja2` to build a frontend. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -[IMPORTANT] -==== -This project assumes you have a working backend from the previous project. Need a clean slate? You can copy over a working API starting Saturday, February 25th. - -[source,bash] ----- -mv $SCRATCH/imdb $SCRATCH/imdb.bak -mkdir $SCRATCH/imdb -cp -a /anvil/projects/tdm/etc/project06/* $SCRATCH/imdb -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH/imdb/backend/api/ ----- - -Then, to run the API, first load up our Python environment. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Next, find an unused port to run the API on. - -[source,bash] ----- -find_port # 7777, for example ----- - -Then, run the API using the port from the previous step, in our case, 7777. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn backend.api.api:app --reload --port 7777 ----- -==== - -In this project, we are going to build a simple _frontend_ for our API using `fastapi` and `jinja2`. Typically, when using `fastapi` and `jinja2`, you would simply create a `templates` directory in your project, and "wire" it all up in the same location. However, it is extremely common for frontends to use different technologies like `reactjs`, `vuejs`, or `svelte`. Sometimes, the frontend and backend teams are even different! - -A clean separation of frontend and backend makes it easier to work on the frontend and backend independently. In order to try and emulate this behavior, we are going to create a frontend that is completely separate from our backend. Endpoints in our frontend will use the `httpx` library to make requests to our backend. This is a bit clunky, but worth the effort to try and emulate a real-world scenario. - -Let's get started by setting up our directory structure for the frontend. Create the following directory structure in your project. - -.directory structure ----- -imdb -├── backend -│ ├── api -│ │ ├── api.py -│ │ ├── database.py -│ │ ├── imdb.db -│ │ ├── pydantic_models.py -│ │ └── queries.sql -│ ├── pyproject.toml -│ └── README.md -├── frontend -│ ├── endpoints.py -│ └── templates -│ └── titles.html -└── README.md - -4 directories, 10 files ----- - -Next, let's fill in our `titles.html` and `endpoints.py` files, with a basic example to get started. - -.endpoints.py ----- -from fastapi import FastAPI, Request -from fastapi.responses import HTMLResponse, JSONResponse, PlainTextResponse -from fastapi.templating import Jinja2Templates -import httpx - -app = FastAPI() -templates = Jinja2Templates(directory='frontend/templates') - -port = 7777 - -@app.get("/titles/{title_id}", response_class=HTMLResponse) -async def get_title(request: Request, title_id: str): - async with httpx.AsyncClient() as client: - resp = await client.get(f'http://localhost:{port}/titles/{title_id}') - - return templates.TemplateResponse("titles.html", {"request": request, "object": resp.json()}) ----- - -.titles.html ----- - - - {{ object.title_id }} - - -

Test

- {{ object }} - - ----- - -Finally, a few notes. - -. You can change the value of `port` to whatever port you are using for your backend. Remember, you can use `find_port` to find an unused port. -+ -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 - -find_port # 7777, for example ----- -+ -. You will _also_ need to run your frontend on some `port` (for this project), you can use `find_port` to find an unused port as well. -. We use the `httpx` package to make requests to our backend, retrieve the response, and then pass it to our template. - -Now, in 1 terminal, run your backend on some `port`. Open another terminal, and run your frontend on some `port`. You can run the frontend using the following command. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 - -cd $SCRATCH/imdb -python3 -m uvicorn frontend.endpoints:app --reload --port 8888 # replace 8888 with your port for your frontend ----- - -Finally, open a browser, and navigate to `http://localhost:8888/titles/tt0241527`. You should see something like the following. - -image::figure49.webp[Expected result, width=792, height=500, loading=lazy, title="Expected result"] - -For this question, include a screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/titles/tt1197624`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/titles/tt1197624` -==== - -=== Question 2 - -Okay great! At this point in time, you should have a working frontend and backend. The goal of this project is to learn about and use `jinja2` to build a frontend. Here are some resources to help you. - -- https://fastapi.tiangolo.com/advanced/templates/?h=template -- https://jinja.palletsprojects.com/en/3.1.x/templates/#synopsis - -Each of the following questions will introduce a new requirement for your frontend. You will need to add functions to your `endpoints.py` file, add new HTML templates to your `templates` directory, and add stylistic elements to your HTML templates using a CSS framework. - -Add new functions for each of the other two endpoints in your backend. In addition, create new HTML templates for each of the other two endpoints as well, that return a very basic HTML page. Basically, duplicate what we did for `titles` for `cast` and `people`. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/cast/tt1197624`. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/people/nm1046097`. -==== - -=== Question 3 - -Update your 3 templates to use HTML elements that make sense. For example, items that are a part of a list, perhaps you should use the `ul` and `li` tags. Titles, should maybe be in an `h1` or `h2` tag, etc. - -Finally, make sure to use a https://jinja.palletsprojects.com/en/3.1.x/templates/#list-of-control-structures[for loop] at least 1 time in at least 1 of your templates. - -Add new screenshots of your updated webpages to your Jupyter notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/titles/tt1197624`. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/cast/tt1197624`. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/people/nm1046097`. -==== - -=== Question 4 - -Update your `cast.html` template so that each member of the cast has a link to their `people` page. - -Include a couple screenshots demonstrating your updated page's functionality. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Finally, HTML is not very pretty, and doesn't give you a lot of room for expression. Let's add some CSS using the bootstrap framework. Add the following tag to the `head` of each of your HTML templates. - -[source,html] ----- - ----- - -Now, check out the bootstrap docs https://getbootstrap.com/docs/5.3/getting-started/introduction/[here], and the examples https://getbootstrap.com/docs/5.3/examples/[here]. - -Use the `class` attributes to add some styling to _all_ of your templates. Once you feel satisfied with your styling, add a screenshot of each of your updated pages to your Jupyter notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/titles/tt1197624`. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/cast/tt1197624`. -- Screenshot in your Jupyter notebook showing the output of your frontend when you navigate to `http://localhost:8888/people/nm1046097`. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project08.adoc deleted file mode 100644 index 559625a48..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project08.adoc +++ /dev/null @@ -1,223 +0,0 @@ -= TDM 40200: Project 8 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the seventh in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -- Learn how `fastapi` and `pydantic` work together to create endpoints that validate data and return typed responses. -- Create a directory structure to assemble a `fastapi` backend. -- Use `fastapi` and `jinja2` to build a frontend. -- Use `fastapi` and `html` to create forms to submit data to the database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -[WARNING] -==== -Interested in being a TA? Please apply: https://purdue.ca1.qualtrics.com/jfe/form/SV_08IIpwh19umLvbE -==== - -=== Question 1 - -[IMPORTANT] -==== -This project assumes you have a working backend and frontend from the previous project. Need a clean slate? You can copy over a working API starting Saturday, March 4th. - -[source,bash] ----- -mv $SCRATCH/imdb $SCRATCH/imdb.bak2 -mkdir $SCRATCH/imdb -cp -a /anvil/projects/tdm/etc/project07/* $SCRATCH/imdb -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH/imdb/backend/api/ ----- - -Then, to run the API, first load up our Python environment. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Next, find an unused port to run the API on. - -[source,bash] ----- -find_port # 7777, for example ----- - -Then, run the API using the port from the previous step, in our case, 7777. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn backend.api.api:app --reload --port 7777 ----- - -In addition, open another terminal to run the frontend using another port, in our case, 8888. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn frontend.endpoints:app --reload --port 8888 ----- - -You can visit the following links to see the barebones pages. - -- http://localhost:8888/titles/tt1825683 -- http://localhost:8888/people/nm0430107 -- http://localhost:8888/cast/tt1825683 -==== - -The goal of this project is to create functioning _forms_ that allow a user to add a new person and title to the database. We will break down the steps for you, as forms _can_ be a bit confusing. - -Let's start by creating templates for our 2 new forms: `create_person.html` and `create_title.html`. These are templates that will live in our _frontend_. Use the following resource to put together the `
` elements for each of these pages. - -- https://developer.mozilla.org/en-US/docs/Web/HTML/Element/form - -[TIP] -==== -Pay close attention to the `action` and `method` attributes of the `` element. Since we are uploading new data, the `method` will be `post`, and the `action` will be the URL of the endpoint that will handle the form submission. -==== - -[TIP] -==== -You can use this as a starting point for your templates. - -[source,html] ----- - - - Create thing - - - - - ----- -==== - -[TIP] -==== -Don't forget your "submit" button! This will be responsible for making the request to the given endpoint in your `action` attribute with the http method specified in your `method` attribute. For example, if you have a form with the `action` attribute `http://localhost:1234/api/thing` and the `method` attribute `put`, then when the user clicks the submit button, the browser will make a `PUT` request to `http://localhost:1234/api/thing`, with the content of the form fields. -==== - -[TIP] -==== -For this project, our people request will have the following fields: `person_id`, `name`, `born`, and `died` only. -For this project, our titles request will have the following fields: `title_id`, `type`, `primary_title`, `original_title`, `runtime_minutes`, `premiered`, and `ended` only. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Next, its time to create endpoints that will show our forms! Since these endpoints will only be responsible for displaying our form, they should be part of our _frontend_. We will create 2 endpoints: - -- `GET /people/create` -- `GET /titles/create` - -These endpoints should simply display the forms we created in the previous step. - -For this question, include a screenshot showing each of your forms in the browser. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -At this stage, you should be able to pop open a browser and visit `http://localhost:8888/people/create/` and `http://localhost:8888/titles/create/` and see your forms. However, if you try to submit the form, nothing will happen -- after all, we haven't created the api endpoints that will handle the form submissions yet! - -Let's start that process now. - -First, create two new queries in your `queries.sql` file: `create_person` and `create_title`. These queries should insert a new row into the `people` and `titles` tables, respectively. - -For this question, paste the queries (the complete additions to the `queries.sql` file) in a jupyter notebook markdown cell. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Finally, create two new api endpoints (in your backend). These endpoints should be straightforward and do the following. - -. Establish a connection to the database. -. Insert the data. -. Return a `dict` with values as a form of a success message. - -[TIP] -==== -https://fastapi.tiangolo.com/tutorial/request-forms/#define-form-parameters[This] and https://github.com/tiangolo/fastapi/issues/854[this] will likely be helpful. -==== - -[TIP] -==== -If you want a field to be optional, you'll want to do something like: - -[source,python] ----- -from typing import Union -from fastapi import Form - -async def some_func(some_field: Union[str, None] = Form(None)): - pass ----- -==== - -[TIP] -==== -These will need to be `POST` requests, since we are adding new data to the database. -==== - -For this question, go ahead and test it out! Please _use_ your new forms to create a new person and new title. Include screenshots of the forms right _before_ clicking "submit". Then, include screenshots of the forms right _after_ clicking "submit". - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Last but certainly not least, lets go ahead and view our new title and person using our frontend. Navigate to the following pages and include screenshots of the pages in your notebook. - -- http://localhost:8888/titles/{your_new_title_id} -- http://localhost:8888/people/{your_new_person_id} - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project09.adoc deleted file mode 100644 index 8d1b88718..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project09.adoc +++ /dev/null @@ -1,198 +0,0 @@ -= TDM 40200: Project 9 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the eighth in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -- Learn how `fastapi` and `pydantic` work together to create endpoints that validate data and return typed responses. -- Create a directory structure to assemble a `fastapi` backend. -- Use `fastapi` and `jinja2` to build a frontend. -- Use `fastapi` and `html` to create forms to submit data to the database. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Question 1 - -Let's start this project with a fresh slate: fresh frontend and backend to work on. - -[source,bash] ----- -mv $SCRATCH/imdb $SCRATCH/imdb.bak3 -mkdir $SCRATCH/imdb -cp -a /anvil/projects/tdm/etc/project08/* $SCRATCH/imdb -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH/imdb/backend/api/ ----- - -Then, to run the API, first load up our Python environment. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Next, find an unused port to run the API on. - -[source,bash] ----- -find_port # 7777, for example ----- - -Then, run the API using the port from the previous step, in our case, 7777. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn backend.api.api:app --reload --port 7777 ----- - -In addition, open another terminal to run the frontend using another port, in our case, 8888. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn frontend.endpoints:app --reload --port 8888 ----- - -You can visit the following links to see the barebones pages. - -- http://localhost:8888/titles/tt1825683 -- http://localhost:8888/people/nm0430107 -- http://localhost:8888/cast/tt1825683 - -This is all you need to do for this question -- just make sure those pages load up correctly. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -You may notice a few changes to the frontend and backend. They are largely just changes so that you can add a title, select the genres for the title, and add the number of votes and rating as well. If you peek in the backend, you will see that this is accomplished using 2 database inserts. One insert handles inserting the data that normally lives in the `titles` table, and the other handles inserting the `rating`, `votes`, and `title_id` in the `ratings` table. Its not any harder than the previous project, however, it added unnecessary complexity that we didn't want you to worry about for your very first project dealing with forms. - -For this project, we are going to implement two new pages: - -- http://localhost:8888/people/{person_id}/update -- http://localhost:8888/titles/{title_id}/update - -Ultimately, these two pages will display forms with the current data for the given person with `person_id` or title with `title_id`, and allow the user to replace any information with new _updated_ information. Upon clicking the buttons, the data will be updated in the database. Pretty straightforward! This is all common behavior, and the _update_ part of CRUD (create, read, update, and delete). - -First thing is first, start by creating two new templates: `update_person.html` and `update_title.html`. For now, these can be copy/pasted from the `create_person.html` and `create_title.html` templates -- we will make progressive modifications to these templates as we go. - -Next, create two new endpoints in `endpoints.py`: 1 to handle displaying the `update_person.html` template (using a function called `update_person`), and 1 to handle displaying the `update_title.html` template (using a function called `update_title`). - -Finish this question by taking two screenshots of the two new pages, and including them in your jupyter notebook. - -- http://localhost:8888/people/nm0430107/update -- http://localhost:8888/titles/tt1825683/update - -Please make sure that the URL is included in the screenshots. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Well, these pages are pretty boring, and so far, not really helpful at all! We can't really _see_ the data we are wanting to modify! Who is person `nm0430107`? What is title `tt1825683`? It is important to _both_ have a form to update data and to be able to _see_ the data we are wanting to update! - -The way to accomplish this is by using the `value` attribute of the various `input` tags in the `update_title.html` and `update_person.html` templates. For example, given the following `input` tag: - -[source,html] ----- - ----- - -The resulting form will look like a regular "text" `input` field, but it will already have the text "John Smith" inside of it! - -Update the `update_person` and `update_title` functions in `endpoints.py` to make a request (using the `httpx` library, just like we do in the `get_title` function in `endpoints.py`) and get the current information for the person or title of interest. Pass this data to the `update_person.html` or `update_title.html` templates, and use the `value` attribute to display the current data in the form. - -Finish this question by taking two screenshots of the two new pages, and including them in your jupyter notebook. - -- http://localhost:8888/people/nm0430107/update -- http://localhost:8888/titles/tt1825683/update - -Please make sure that the URL is included in the screenshots. - -[TIP] -==== -A tip for handling the checkboxes. In your `endpoints.py` `update_title` function, edit the response before returning it, as follows. - -[source,python] ----- -response = resp.json() - - genres = response.get("genres") - for genre in genres: - response[genre.lower().replace("-", "_")] = True ----- - -This will do two primary things. Let you access each checkbox and check it in the template by doing something like: - -[source,html] ----- - ----- - -In addition, it will convert the "-" in "sci-fi" to an underscore, so it can be accessed in the template. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, the final step is to actually update the data in the database upon a form submission. In order to make this work, you must update your two templates so that the `method` attribute is `post`, and the `action` attribute is `/people/{person_id}/update` or `/titles/{title_id}/update` endpoints of your backend. - -Next, you must update your backend, so those endpoints take the form data and actually _update_ the data in the database. To do this, you will need to create two new endpoints in `api.py` that take the form data and update the values in the database. Please note, you will also need to create two new queries in `queries.sql`. - -Use both of your new "update" forms to update a known title and actor. Take two screenshots of the pages that appear _after_ submitting the forms. Include these screenshots in your jupyter notebook. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Finally, prove that your updates were successful by taking two screenshots of the following pages, now that you've updated the data in the database. - -- http://localhost:8888/people/nm0430107 -- http://localhost:8888/titles/tt1825683 - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project10.adoc deleted file mode 100644 index dff708cd6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project10.adoc +++ /dev/null @@ -1,159 +0,0 @@ -= TDM 40200: Project 10 -- 2023 - -**Motivation:** Dashboards are everywhere -- many of our corporate partners' projects are to build dashboards (or dashboard variants)! Dashboards are used to interactively visualize some set of data. Dashboards can be used to display, add, remove, filter, or complete some customized operation to data. Ultimately, a dashboard is really a website focused on displaying data. Dashboards are so popular, there are entire frameworks designed around making them with less effort, faster. Two of the more popular examples of such frameworks are https://shiny.rstudio.com/[`shiny`] (in R) and https://dash.plotly.com/introduction[`dash`] (in Python). While these tools are incredibly useful, it can be very beneficial to take a step back and build a dashboard (or website) from scratch (we are going to utilize many powerfuly packages and tools that make this far from "scratch", but it will still be more from scratch than those dashboard frameworks). - -**Context:** This is the ninth in a series of projects focused around slowly building a dashboard. Students will have the opportunity to: create a backend (API) using `fastapi`, connect the backend to a database using `aiosql`, use the `jinja2` templating engine to create a frontend, use `htmx` to add "reactivity" to the frontend, create and use forms to insert data into the database, containerize the application so it can be deployed anywhere, and deploy the application to a cloud provider. Each week the project will build on the previous week, however, each week will be self-contained. This means that you can complete the project in any order, and if you miss a week, you can still complete the following project using the provided starting point. - -**Scope:** Python, dashboards - -.Learning Objectives -**** -- Continue to develop skills and techniques using `fastapi` to build a backend. -- Learn how to use `pydantic` for data validation and type hints. -- Learn how `fastapi` and `pydantic` work together to create endpoints that validate data and return typed responses. -- Create a directory structure to assemble a `fastapi` backend. -- Use `fastapi` and `jinja2` to build a frontend. -- Use `fastapi` and `html` to create forms to submit data to the database. -- Use `htmx` to add "reactivity" to the frontend. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/movies_and_tv/imdb.db` - -== Questions - -=== Part 1 - -Let's start this project with a fresh slate: fresh frontend and backend to work on. - -[source,bash] ----- -mv $SCRATCH/imdb $SCRATCH/imdb.bak4 -mkdir $SCRATCH/imdb -cp -a /anvil/projects/tdm/etc/project09/* $SCRATCH/imdb -cp /anvil/projects/tdm/data/movies_and_tv/imdb.db $SCRATCH/imdb/backend/api/ ----- - -Then, to run the API, first load up our Python environment. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -module load python/f2022-s2023 ----- - -Next, find an unused port to run the API on. - -[source,bash] ----- -find_port # 7777, for example ----- - -Then, run the API using the port from the previous step, in our case, 7777. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn backend.api.api:app --reload --port 7777 ----- - -In addition, open another terminal to run the frontend using another port, in our case, 8888. - -[source,bash] ----- -cd $SCRATCH/imdb -python3 -m uvicorn frontend.endpoints:app --reload --port 8888 ----- - -You can visit the following links to see the barebones pages. - -- http://localhost:8888/titles/tt1825683 -- http://localhost:8888/people/nm0430107 -- http://localhost:8888/cast/tt1825683 -- http://localhost:8888/titles/tt1825683/update -- http://localhost:8888/people/nm0430107/update - -This is all you need to do for this question -- just make sure those pages load up correctly. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Part 2 - -The goal of this project is straightforward. Use `htmx` and make the modifications needed in order to update the following page so that it behaves like https://htmx.org/examples/click-to-edit/[this] `htmx` example. - -- http://localhost:8888/people/nm0430107/update - -Make the modifications needed so you can see the following. Of course, if you added your own custom CSS, that is perfectly fine, the _behavior_ is what is critical. - -image::figure50.gif[Goal results, width=792, height=500, loading=lazy, title="Goal results"] - -[TIP] -==== -These are links that you may find helpful: - -- https://www.python-httpx.org/quickstart/#sending-form-encoded-data -- https://htmx.org/examples/click-to-edit/ -- https://github.com/renceInbox/fastapi-todo -- https://htmx.org/docs/ -==== - -[TIP] -==== -To make a request and pass along the form data, you can use the following code: - -[source,python] ----- -async with httpx.AsyncClient() as client: - resp = await client.post(f'{URL}', data=form_data) ----- - -Where `form_data` is a dict of key/value pairs. -==== - -[TIP] -==== -This effectively transforms this part of the web app into a SPA (single page app) -- you will notice that in the example, the URL does not change. -==== - -[TIP] -==== -You don't need to modify the _backend_ at all -- this project is all about the frontend. -==== - -[TIP] -==== -Make sure that if you update an actor that is not yet dead, that you don't leave "None" when updating the death date -- this will throw an error since "None" is not a valid number. Just put in a number like 2050 or 2100. -==== - -[TIP] -==== -You'll ultimately just need to modify: `people.html`, `update_person.html`, and `endpoints.py`. -==== - -[TIP] -==== -You'll ultimately just need to add a single endpoint to `endpoints.py`. -==== - -.Items to submit -==== -- Code used to solve this problem: the templates you updated, and `endpoints.py`. -- GIF or video demonstrating the behavior of the web app -- just like the example gif, but using a different actor. Be sure to include the entire screen, including the URL bar. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project11.adoc deleted file mode 100644 index fb72d9285..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project11.adoc +++ /dev/null @@ -1,505 +0,0 @@ -= TDM 40200: Project 11 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. In the previous series of projects you've built a web application. While right now it may be easy to share and run your application with another individual, as time goes on and packages are updated, this is less and less likely to be the case. Containerizing your application ensures that the application will have the proper versions of the proper packages available in the proper location to run. - -**Context:** This is a first of a series of projects focused on containers. The end goal of this series is to solidify the concept of a container, and enable you to "containerize" the application you've spent the semester building. You will even get the opportunity to deploy your containerized application! - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use UNIX tools to effectively create a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -The most popular containerization tool at the time of writing is likely Docker. Unfortunately, Docker is not available on Anvil, as it currently does not enable _rootless_ container creation. In addition, for this first project, we want to mess around with some UNIX tools to _essentially_ create a container -- these tools _also_ require superuser permissions. Therefore, this project will be completed completely from within a shell, using shell tools, on a virtual machine which you will launch. - -We will essentially be running a container on a virtual machine from within a SLURM job on Anvil. Sounds a bit crazy, and it is, but it will provide you with the ability to work fearlessly and break things. Of course, if you _do_ break things, you can _easily_ reset! - -First thing is first. Open up a terminal on Anvil. This could be from within Jupyter Lab, or via VS Code, or just from an `ssh` session from within your own terminal. - -Next, to ensure that SLURM environment variables don't alter or effect our SLURM job, run the following. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Next, let's make a copy of a pre-made operating system image. This image has Alpine Linux and a few basic tools installed, including: nano, vim, emacs, and Docker. - -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/builder.qcow2 $SCRATCH ----- - -Next, we want to acquire enough resources (CPU and memory) to not have to worry about something not working. To do this we will use SLURM to launch a job with 4 cores and about 8GB of memory. - -[source,bash] ----- -salloc -A cis220051 -p shared -n 4 -c 1 -t 04:00:00 ----- - -Next, we need to make `qemu` available to our shell. - -[source,bash] ----- -module load qemu ----- - -Next, let's launch our virtual machine with about 8GB of memory and 4 cores. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/builder.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::2200-:22 & ----- - -[IMPORTANT] -==== -If for some reason you get an error or message saying that port 2200 is being used, no problem! Just change the number in the previous command from 2200 to the output of the following command. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -find_port # this will print a port number ----- - -Then, when you run the `ssh` command below, use the port number that was printed by `find_port`, instead of 2200. -==== - -Next, its time to connect to our virtual machine. We will use `ssh` to do this. - -[source,bash] ----- -ssh -p 2200 tdm@localhost -o StrictHostKeyChecking=no ----- - -If the command fails, try waiting a minute and rerunning the command -- it may take a minute for the virtual machine to boot up. - -When prompted for a password, enter `purdue`. Your username is `tdm` and password is `purdue`. - -Finally, now that you have a shell in your virtual machine, you can do anything you want! You have superuser permissions within your virtual machine! To run a command as the super user prepend `doas` to the command. For example, to list the files in the `/root` directory, you would run: `doas ls /root` -- it may prompt you for a password, which is `purdue`. - -If at any time you break something and don't know how to fix it, you can "reset" everything by simply killing the virtual machine, removing `$SCRATCH/builder.qcow2`, and rerunning the commands above. - -To kill the virtual machine. - -[source,bash] ----- -# exit the virtual machine by typing "exit" then, on Anvil, run: -fg %1 # or fg 1 -- this will bring the process to the foreground -# finally, press CTRL+C to kill the process ----- - -For this question, submit a screenshot showing the output of `hostname` from within your virtual machine! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -[TIP] -==== -I would highly recommend watching https://www.youtube.com/watch?v=8fi7uSYlOdc. It is an excellent 40 minute video where the author essentially creates a container using golang. While you may not understand golang, she does a great job of explaining, and it will give you a good idea of what is going on. -==== - -[TIP] -==== -This link: https://ericchiang.github.io/post/containers-from-scratch/ - -Is the inspiration for this project. We are just translating it over using our own tools and resources. -==== - -First thing is first. Let's get a root filesystem that will be the "base" of our container. Since our virtual machine is running Alpine Linux, it could be cool to have our container be based on a different operating system -- let's use Ubuntu. - -From within your virtual machine, run the following. - -[NOTE] -==== -From this point forward, when we ask you to run any command, please assume we mean from inside your virtual machine unless otherwise specified. -==== - -[source,bash] ----- -wget https://releases.ubuntu.com/20.04.6/ubuntu-20.04.6-live-server-amd64.iso ----- - -This will download the `.iso` file from Ubuntu. Next, we need to mount the `.iso` file so that we can access the files within it. - -[source,bash] ----- -# create a directory to mount the iso file on -mkdir /home/tdm/ubuntu_mounted - -# mount the iso file -doas modprobe loop -doas mount -t iso9660 ubuntu-20.04.6-live-server-amd64.iso ubuntu_mounted ----- - -Now, if you run `ls -la /home/tdm/ubuntu_mounted`, you should see a bunch of files and directories. - -.ls -la /home/tdm/ubuntu_mounted ----- -total 83 -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 . -drwxr-sr-x 4 tdm tdm 4096 Mar 30 11:11 .. -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 .disk -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 EFI -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 boot -dr-xr-xr-x 1 root root 2048 Mar 14 18:02 casper -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 dists -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 install -dr-xr-xr-x 1 root root 34816 Mar 14 18:01 isolinux --r--r--r-- 1 root root 27491 Mar 14 18:02 md5sum.txt -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 pool -dr-xr-xr-x 1 root root 2048 Mar 14 18:01 preseed -lr-xr-xr-x 1 root root 1 Mar 14 18:01 ubuntu -> . ----- - -We want the _filesystem_ from this iso. The filesystem is inside the following file: `/home/tdm/ubuntu_mounted/casper/filesystem.squashfs`. We have to unarchive that file, but before we can do that we need to install a package. - -[source,bash] ----- -doas apk add squashfs-tools ----- - -Now, we can unarchive the file. - -[source,bash] ----- -mkdir /home/tdm/ubuntu_fs -cp /home/tdm/ubuntu_mounted/casper/filesystem.squashfs /home/tdm/ubuntu_fs -cd /home/tdm/ubuntu_fs -doas unsquashfs filesystem.squashfs -cd -doas mv /home/tdm/ubuntu_fs/squashfs-root /home/tdm/ -rm -rf /home/tdm/ubuntu_fs/* -doas cp -r /home/tdm/squashfs-root/* /home/tdm/ubuntu_fs/ - -# cleanup -doas umount ubuntu_mounted -rmdir /home/tdm/ubuntu_mounted -doas rm ubuntu-20.04.6-live-server-amd64.iso -doas rm -rf /home/tdm/squashfs-root ----- - -Finally, inside `/home/tdm/ubuntu_fs`, you should see the root filesystem for Ubuntu. - -.ls -la /home/tdm/ubuntu_fs ----- -total 72 -drwxr-sr-x 18 tdm tdm 4096 Mar 30 11:29 . -drwxr-sr-x 4 tdm tdm 4096 Mar 30 11:32 .. -lrwxrwxrwx 1 tdm tdm 7 Mar 30 11:29 bin -> usr/bin -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 boot -drwxr-xr-x 5 tdm tdm 4096 Mar 30 11:29 dev -drwxr-xr-x 95 tdm tdm 4096 Mar 30 11:29 etc -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 home -lrwxrwxrwx 1 tdm tdm 7 Mar 30 11:29 lib -> usr/lib -lrwxrwxrwx 1 tdm tdm 9 Mar 30 11:29 lib32 -> usr/lib32 -lrwxrwxrwx 1 tdm tdm 9 Mar 30 11:29 lib64 -> usr/lib64 -lrwxrwxrwx 1 tdm tdm 10 Mar 30 11:29 libx32 -> usr/libx32 -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 media -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 mnt -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 opt -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 proc -drwx------ 2 tdm tdm 4096 Mar 30 11:29 root -drwxr-xr-x 11 tdm tdm 4096 Mar 30 11:29 run -lrwxrwxrwx 1 tdm tdm 8 Mar 30 11:29 sbin -> usr/sbin -drwxr-xr-x 6 tdm tdm 4096 Mar 30 11:29 snap -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 srv -drwxr-xr-x 2 tdm tdm 4096 Mar 30 11:29 sys -drwxr-xr-t 2 tdm tdm 4096 Mar 30 11:29 tmp -drwxr-xr-x 14 tdm tdm 4096 Mar 30 11:29 usr -drwxr-xr-x 13 tdm tdm 4096 Mar 30 11:29 var ----- - -Awesome! We are going to use this later! - -For this question, please include a screenshot of the final "product" -- the output of the `ls -la` command on the `/home/tdm/ubuntu_fs` directory. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -As mentioned before, we are going to follow very closely to https://ericchiang.github.io/post/containers-from-scratch/[this excellent post]. Therefore, the first tool we will be using is `chroot` (think "change root"). `chroot` is a command that allows you to change the root directory of the current process and its children. - -Currently, our root filesystem (in Alpine Linux _of_ Alpine Linux) is the following: - -.ls -la / ----- -total 85 -drwxr-xr-x 22 root root 4096 Feb 8 09:06 . -drwxr-xr-x 22 root root 4096 Feb 8 09:06 .. -drwxr-xr-x 2 root root 4096 Mar 30 11:22 bin -drwxr-xr-x 3 root root 1024 Feb 8 09:14 boot -drwxr-xr-x 13 root root 3120 Mar 30 10:56 dev -drwxr-xr-x 35 root root 4096 Mar 30 10:56 etc -drwxr-xr-x 4 root root 4096 Mar 30 10:10 home -drwxr-xr-x 10 root root 4096 Feb 8 09:14 lib -drwx------ 2 root root 16384 Feb 8 08:59 lost+found -drwxr-xr-x 5 root root 4096 Feb 8 08:59 media -drwxr-xr-x 2 root root 4096 Feb 8 08:59 mnt -drwxr-xr-x 3 root root 4096 Feb 8 09:19 opt -dr-xr-xr-x 149 root root 0 Mar 30 10:56 proc -drwx------ 2 root root 4096 Feb 8 09:09 root -drwxr-xr-x 8 root root 440 Mar 30 11:11 run -drwxr-xr-x 2 root root 12288 Feb 8 09:16 sbin -drwxr-xr-x 2 root root 4096 Feb 8 08:59 srv -drwxr-xr-x 2 root root 4096 Feb 8 09:06 swap -dr-xr-xr-x 13 root root 0 Mar 30 10:56 sys -drwxrwxrwt 4 root root 80 Mar 30 10:56 tmp -drwxr-xr-x 9 root root 4096 Mar 30 10:17 usr -drwxr-xr-x 13 root root 4096 Mar 30 10:17 var ----- - -We want to make it so that our root filesystem is the contents of our `ubuntu_fs` directory. To do this, we will use the `chroot` command. - -[source,bash] ----- -doas chroot /home/tdm/ubuntu_fs /bin/bash ----- - -This will result in running the `/bin/bash` shell where the root filesystem is the contents of the `/home/tdm/ubuntu_fs` directory. You'll have a `bash` shell _inside_ this directory. As a result, for example, you could run commands only available in Ubuntu: - -[source,bash] ----- -lsb_release -a ----- - -As you will be able to see, in _this_ shell, the root filesystem is the contents of the `/home/tdm/ubuntu_fs` directory: - -.ls -la / ----- -total 72 -drwxr-sr-x 18 1001 1001 4096 Mar 30 16:29 . -drwxr-sr-x 18 1001 1001 4096 Mar 30 16:29 .. -lrwxrwxrwx 1 1001 1001 7 Mar 30 16:29 bin -> usr/bin -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 boot -drwxr-xr-x 5 1001 1001 4096 Mar 30 16:29 dev -drwxr-xr-x 95 1001 1001 4096 Mar 30 16:29 etc -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 home -lrwxrwxrwx 1 1001 1001 7 Mar 30 16:29 lib -> usr/lib -lrwxrwxrwx 1 1001 1001 9 Mar 30 16:29 lib32 -> usr/lib32 -lrwxrwxrwx 1 1001 1001 9 Mar 30 16:29 lib64 -> usr/lib64 -lrwxrwxrwx 1 1001 1001 10 Mar 30 16:29 libx32 -> usr/libx32 -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 media -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 mnt -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 opt -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 proc -drwx------ 2 1001 1001 4096 Mar 30 16:29 root -drwxr-xr-x 11 1001 1001 4096 Mar 30 16:29 run -lrwxrwxrwx 1 1001 1001 8 Mar 30 16:29 sbin -> usr/sbin -drwxr-xr-x 6 1001 1001 4096 Mar 30 16:29 snap -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 srv -drwxr-xr-x 2 1001 1001 4096 Mar 30 16:29 sys -drwxr-xr-t 2 1001 1001 4096 Mar 30 16:38 tmp -drwxr-xr-x 14 1001 1001 4096 Mar 30 16:29 usr -drwxr-xr-x 13 1001 1001 4096 Mar 30 16:29 var ----- - -So, when in this shell, running `ls -la` is actually running `/home/tdm/ubuntu_fs/usr/bin/ls -la`. Very cool! This is pretty powerful already and may even _feel_ kind of like a container! Let's test out how isolated we are. Open _another_ terminal and connect to your virtual machine from that terminal as well. This will involve first using `ssh` to connect to the backend where your SLURM job is running, and then using `ssh` to connect to your virtual machine from there. - -[source,bash] ----- -ssh a240.anvil.rcac.purdue.edu # connect to the given backend -- in my case, it was a240 -- yours may be different! -ssh -p 2200 tdm@localhost -o StrictHostKeyChecking=no # connect to the virtual machine ----- - -Once you are connected to your virtual machine, run the following command: - -[source,bash] ----- -top ----- - -Now, in your `chroot` "jail", run the following command: - -[source,bash] ----- -mount -t proc proc /proc -ps aux | grep -i top ----- - -If done correctly, you likely saw output similar to the following. - ----- -1001 2617 0.0 0.0 1624 960 ? S+ 16:49 0:00 top -root 2622 0.0 0.0 3312 720 ? S+ 16:50 0:00 grep --color=auto -i top ----- - -We are _inside_ our container, yet we can see the `top` command running on our VM. We are clearly _not_ isolated enough! In fact, from within our "container" we could probably even kill the `top` process that is outside of our "container": - -[source,bash] ----- -pkill top - -# after running this from inside our "container" switch tabs and you'll find that the top process stopped running! ----- - -To fix this, we need to create a _namespace_. - -[quote, Eric Chiang, https://ericchiang.github.io/post/containers-from-scratch/] -____ -Namespaces allow us to create restricted views of systems like the process tree, network interfaces, and mounts. - -Creating namespace is super easy, just a single syscall with one argument, unshare. The unshare command line tool gives us a nice wrapper around this syscall and lets us setup namespaces manually. In this case, we will create a PID namespace for the shell, then execute the chroot like the last example. -____ - -Let's test this out. First, exit our "container" by running `exit`. If you properly exited, the following command will no longer work. - -[source,bash] ----- -lsb_release -a ----- - -Now, let's use `unshare` to create a new _process_ or _PID_ namespace. - -[source,bash] ----- -doas unshare -p -f --mount-proc=/home/tdm/ubuntu_fs/proc chroot /home/tdm/ubuntu_fs /bin/bash ----- - -Upon success, you will now find that our shell `/bin/bash` seems to think it is process 1! - -[source,bash] ----- -ps aux ----- - -.output ----- -USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND -root 1 0.0 0.0 4248 3428 ? S 16:59 0:00 /bin/bash -root 12 0.0 0.0 5900 2764 ? R+ 17:00 0:00 ps aux ----- - -We are one step closer! For this question, include a series of screenshots showing your terminal input and output. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Finally, another key component of a container is limiting resources. Eric mentions that it doesn't make a lot of sense to have isolated processes if they can still eat up all of the system CPU and memory and potentially even cause other processes from the host system to crash. This is where `cgroups` (control groups) come in. - -Using `cgroups` we can limit the resources a process can use. For example, we could limit the CPUs or the memory of a process. That is exactly what we will do! Let's start by restricting the cores our container can use. - -On the virtual machine, outside of the container, run the following. - -[source,bash] ----- -doas su # become the superuser/root - -mkdir /sys/fs/cgroup/cpuset/tdm # create a directory for our cpuset cgroup - -ps aux ----- - -The output of `ps aux` should look something like the following. - -.ps aux output ----- - 2028 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info - 2300 root 0:00 /sbin/syslogd -t -n - 2329 root 0:00 /sbin/acpid -f - 2361 chrony 0:00 /usr/sbin/chronyd -f /etc/chrony/chrony.conf - 2388 root 0:00 /usr/sbin/crond -c /etc/crontabs -f - 2426 root 0:00 sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups - 2433 root 0:00 /sbin/getty 38400 tty1 - 2434 root 0:00 /sbin/getty 38400 tty2 - 2438 root 0:00 /sbin/getty 38400 tty3 - 2442 root 0:00 /sbin/getty 38400 tty4 - 2446 root 0:00 /sbin/getty 38400 tty5 - 2450 root 0:00 /sbin/getty 38400 tty6 - 2454 root 0:00 sshd: tdm [priv] - 2456 tdm 0:00 sshd: tdm@pts/0 - 2457 tdm 0:00 -zsh - 2514 root 0:00 unshare -p -f --mount-proc=/home/tdm/ubuntu_fs/proc chroot /home/tdm/ubuntu_fs /bin/bash - 2515 root 0:00 /bin/bash - 2523 root 0:00 sshd: tdm [priv] - 2525 tdm 0:00 sshd: tdm@pts/1 - 2526 tdm 0:00 -zsh - 2528 root 0:00 zsh - 2530 root 0:00 ps aux ----- - -Notice the line _directly below_ the line with `unshare -p -f ...` -- this is the PID of the process we want to restrict! In this case, it is `2515`. - -[source,bash] ----- -echo 0 > /sys/fs/cgroup/cpuset/tdm/cpuset.mems -echo 0 > /sys/fs/cgroup/cpuset/tdm/cpuset.cpus -echo 2515 > /sys/fs/cgroup/cpuset/tdm/tasks - -# this limits the task with PID 2515 to only use CPU 0 - -mkdir /sys/fs/cgroup/memory/tdm # create a directory for our memory cgroup - -# in addition, lets disable swap -echo 0 > /sys/fs/cgroup/memory/tdm/memory.swappiness - -# lets also limit the memory to 100 MB -echo 100000000 > /sys/fs/cgroup/memory/tdm/memory.limit_in_bytes - -echo 2515 > /sys/fs/cgroup/memory/tdm/tasks ----- - -Let's test out the memory cgroup by creating the following `hungry.py` Python script and running it from within our container. - -.hungry.py -[source,python] ----- -x = bytearray(1024*1024*50) -print("Used 50") -y = bytearray(1024*1024*50) -print("Used 100") -z = bytearray(1024*1024*50) -print("Used 150") ----- - -Now, running `python3 hungry.py` from within our container should yield: - -.output ----- -Used 50 -Killed ----- - -Very cool! Now the process was killed because it exceeded the memory limit we set! Hopefully this project demonstrated that containers are easier than they may seem! Of course, these examples are not complete, and containers and various utilities provided by a tool like Docker are both more feature-rich and sound, however, we hope that this demystified things a little bit. - -[NOTE] -==== -There will still be a variety of things that aren't functioning the same way a true container would. For example, running `hostname newhost` in the container would also change the hostname of the VM. You could fix this by adding `--uts` to your `unshare` command. Again, this is just to show you that containers are really just a filesystem + some system calls to isolate the process. -==== - -For this question, like the previous questions, just include some screenshots of your terminals input and output that demonstrate you were able to see the expected results. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project12.adoc deleted file mode 100644 index 90414aae9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project12.adoc +++ /dev/null @@ -1,598 +0,0 @@ -= TDM 40200: Project 12 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. In the previous series of projects you've built a web application. While right now it may be easy to share and run your application with another individual, as time goes on and packages are updated, this is less and less likely to be the case. Containerizing your application ensures that the application will have the proper versions of the proper packages available in the proper location to run. - -**Context:** This is a second of a series of projects focused on containers. The end goal of this series is to solidify the concept of a container, and enable you to "containerize" the application you've spent the semester building. You will even get the opportunity to deploy your containerized application! - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use `docker` to build a container image. -- Understand the difference between the `ENTRYPOINT` and `CMD` commands in a `Dockerfile`. -- Use `docker` to run a container. -- Use `docker` to run a shell inside of a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -In this project, we will begin learning about the various `docker` commands. In addition, we will learn about some of the `Dockerfile` commands, and even build and use some simple containers! - -Like the last project, we will be working in a virtual machine (VM). By using a VM on Anvil, we are able to ensure that everyone has the proper permissions in order to run each and every one of the necessary commands to build and run Docker images. There are 3 main steps needed in order to both get a VM up and running on Anvil resources, and connect and get a shell on the VM from Anvil. - -. Get a terminal on Anvil -- you may complete this part however you like. I like to use `ssh` to connect to Anvil from my local machine, however, you may also use https://ondemand.anvil.rcac.purdue.edu, launch a Jupyter Lab session, and launch a terminal from within Jupyter Lab. Either works equally as well as the other. -. Clear out any potential SLURM environment variables: -+ -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- -+ -. Launch SLURM job with 4 cores and about 8 GB of memory and get a shell into the given backend node: -+ -[source,bash] ----- -salloc -A cis220051 -p shared -n 4 -c 1 -t 04:00:00 ----- -+ -[NOTE] -==== -This job will only buy you 4 hours of time on the backend node. If you need more time, you will need to re-launch the job and change the arguments to `salloc` to request more time. -==== -+ -. Once you have a shell on the backend node, you will need to load the `qemu` module: -+ -[source,bash] ----- -module load qemu ----- -+ -. Next, copy over a fresh VM image to use for this project: -+ -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/alpine.qcow2 $SCRATCH ----- -+ -[NOTE] -==== -If at any time you want to start fresh, you can simply copy over a new VM image from `/anvil/projects/tdm/apps/qemu/images/alpine.qcow2` to your `$SCRATCH` directory. Any changes you made to the previous image will be lost. This is good to know in case you want to try something crazy but are worried about breaking something! No need to worry, you can simply re-copy the VM image and start fresh anytime! -==== -+ -. The previous command will result in a new file called `alpinel.qcow2` in your `$SCRATCH` directory. This is the VM image you will be using for this project. Now, you will need to launch the VM: -+ -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/alpine.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::2200-:22 & ----- -+ -[NOTE] -==== -The last part of the previous command forwards traffic from port 2200 on Anvil to port 22 on the VM. If you receive an error about port 2200 being used, you can change this number to be any other unused port number. To find an unused port you can use a utility we have available to you. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -find_port ----- - -The `find_port` command will output an unused port for you to use. If, for example, it output `12345`, then you would change the `qemu` command to the following. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/alpine.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::12345-:22 & ----- -==== -+ -. After launching the VM, it will be running in the background as a process (this is what the `&` at the end of the command does). After about 15-30 seconds, the VM will be fully booted and you can connect to the VM from Anvil using the `ssh` command. -+ -[source,bash] ----- -ssh -p 2200 tdm@localhost -o StrictHostKeyChecking=no ----- -+ -You may be prompted for a password for the user `tdm`. The password is simply `purdue`. -+ -[IMPORTANT] -==== -If in a previous step you changed the port from say `2200` to something like `12345`, you would change the `ssh` command accordingly. -==== -. Finally, you should be connected to the VM and have a new shell running _inside_ the VM, great! If you were successful, contents of the terminal should look very similar to the following. - -image::figure51.webp[Successfully connected to the VM, width=792, height=500, loading=lazy, title="Successfully connected to the VM"] - -For this question, just include a screenshot of your terminal after you have successfully connect to the VM. - -[IMPORTANT] -==== -If at any time you would like to "save" your progress and restart the project at a later date or time, you can do this by exiting the VM by running the `exit` command. Next, type `jobs` to find the `qemu` job number (probably 1). Finally, bring the `qemu` command to the foreground by typing either `fg 1` or `fg %1` followed by Ctrl+c. This will kill the VM and you can restart the project at a later date or time by simply using the same `alpine.qcow2` image you used previously. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -Whoa, you may have noticed things look a little bit different from the previous project -- that's okay! We made a few modifications that will be useful to you during this project. Let's test out the most useful new feature. - -In your terminal in the VM, list all of the files as follows: - -[source,bash] ----- -ls -la ----- - -Okay, nothing special yet. That is to be expected. Now, in the same terminal session, type the letter "l" and immediately pause for a second. You will quickly notice that the terminal shows "s -la" in light grey text after your initially typed "l". We've installed a program that remembers your shell history and does its best to predict what you will type based on your previous commands. If you press the right arrow on your keyboard, the rest of the "ls -la" command will be typed out fully. This is an extrememly useful feature, especially as we are juggling various `docker` commands that can be long and confusing. For example, you can type "docker" and start typing the up arrow on your keyboard and this tool will cycle through all of your previous commands that started with "docker". - -[TIP] -==== -Another way to remember/recall previous commands you've run is to open the shell history search interface by holding Ctrl+r and then beginning your search as you type. -==== - -Okay, try running the command `docker` and `docker ps` and `docker images`. Follow these command up with "docker" followed by you pressing the up arrow on your keyboard to cycle through your previous commands. Once you are comfortable with this functionality, go ahead and take a screenshot of some of your outputs from these `docker` commands and include it in your submission. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Next, let's build a barebones Docker image using the `docker` command and a `Dockerfile`. - -A `Dockerfile` is a text file that contains a set of instructions for building a Docker image. - -The `docker` command is a command line interface (CLI) that allows you to interact with Docker. - -A Docker image is a essentially a zipped up tarball of a file system that contains all of the files and dependencies needed to run a program. You can think of it as the ubuntu filesystem that you extracted and used with `chroot` in the previous project. - -For a very barebones image, your `Dockerfile` will only need to contain two lines. The first line will be a `FROM` command. This command will tell Docker what base image to build on top of. It is very common to choose a barebones operating system image like `alpine` or `ubuntu` as the base image. - -The second line will be a `CMD` command. This command will tell Docker what command to run when a container is started from the image. - -.Dockerfile ----- -FROM OS -CMD ["command", "arg1", "arg2"] ----- - -Use your favorite command line text editor (the image has `nano`, `vim`, and `emacs` installed already) to create a new file called `Dockerfile` in your home directory. Replace "OS" with the base image you want to use. For this project, we will use an ubuntu image from https://hub.docker.com/_/ubuntu[here] -- specifically the newest stable version of ubuntu, which is currently `22.04` -- `ubuntu:22.04`. Here, `ubuntu` is the repository namespace and `22.04` is the tag specifying a version of the image. While _technically_ ubuntu could put all sorts of different containers in the `ubuntu` namespace under different tags, it is customary to use the tag to specify the version of the image. - -Next, replace "command" with the command you want to run when a container is started from the image. For now, let's use the most basic shell that is available on many linux operating systems, `/bin/sh`. If we had multiple arguments to pass to the command, we would add them to the list of arguments after the first argument. For example, if we wanted to run the command `echo "hello world"`, we would use the following `CMD ["echo", "'hello world'"]` command. - -Once complete, save the file and close the text editor. Now, its time to build our image! Run the following command to build the image: - -[source,bash] ----- -cd -docker build -t myfirstimage . ----- - -[NOTE] -==== -Here, we are using the `-t` flag to specify a tag for our image. This tag will be used to refer to our image in the future. In this case, we are using the tag `myfirstimage`. If you want to use a different tag, you can replace `myfirstimage` with whatever you want. The "." denotes the current working directory, which is where our `Dockerfile` is located. If there was no file named `Dockerfile` in our current working directory, we would have to specify the name of the file we want to use by using the `-f ` flag. This is useful if you have multiple dockerfiles in a single directory. -==== - -Once the image is built, you can check to see that it is there by running the following command: - -[source,bash] ----- -docker images ----- - -You will notice that there are 2 images available: `ubuntu:22.04` and `myfirstimage`. The `ubuntu:22.04` image is the base image we used to build our image on top of. The `myfirstimage` image is the image we just built. Very cool! - -Now, let's run our image! Run the following command to run our image: - -[source,bash] ----- -docker run -it myfirstimage ----- - -[NOTE] -==== -The `-i` stands for interactive -- without it, we would not be able to interact with the container -- commands would just be shown with no output. The `-t` stands for tty -- without it, we would not have a functioning terminal. Essentially, we need both of these flags in order to have a shell running in our container. -==== - -Congratulations! You now have a shell (`/bin/sh`) running in a container! - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, now that you have a shell running in the container, let's take a minute to clarify the _last_ line of our `Dockerfile`. - -There are two important commands that we need to delineate: `CMD` and `ENTRYPOINT`. It is kind of a mess, but it is important to take the time to understand the differences -- otherwise it will be more difficult to debug your containers in the future. - -The following `Dockerfile` has a single `CMD` line. In a `Dockerfile`, there can only be a single `CMD` line -- if there is more, only the _last_ `CMD` line will be respected. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["/bin/sh", "-c", "echo cwd: $PWD"] ----- - -This `CMD` is in _exec_ form. This means that: - -. There is the use of square brackets around the arguments. -. The first argument is an executable file. - -Build the image and run it. What was your output? - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -Now, run it and pass in a different command. What was your output? - -[source,bash] ----- -docker run -it myfirstimage /bin/sh ----- - -Modify the `Dockerfile` and rebuild your image. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["echo", "cwd: $PWD"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -What happens? Instead of "cwd: /" you get "cwd: $PWD". This is because variable substitution doesn't occur since a shell isn't processing the commands. We can, however, use the _shell_ form. This means that: - -. There is _no_ use of square brackets around the arguments. -. The commands and arguments are passed to the `sh` shell. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD echo "cwd: $PWD" ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -You once again get "cwd: /" since the `sh` shell is performing the variable substitution! Behind the scenes it is really running `/bin/sh -c "echo cwd: $PWD"`. - -Finally, there is another series of scenarios that we can explore that have to do with our `ENTRYPOINT`. The first being -- what happens if we are not using the _shell_ form of `CMD` _and_ our first argument is _not_ and executable like `echo` or `/bin/sh`? Let's find out! - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["cwd: $PWD"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -What happens? It fails! Docker doesn't understand what to do with that, since it isn't anything executable. In these scenarios, you need to specify an `ENTRYPOINT`. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["cwd: $PWD"] -ENTRYPOINT ["echo"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -It works just like when we did the following! - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["echo", "cwd: $PWD"] ----- - -Or does it? In this case, once again there is no variable substitution because a shell is not processing the commands. However, you will find things _are_ different than before. Before, you could run the following: - -[source,bash] ----- -docker run -it myfirstimage /bin/sh ----- - -The result would be that the contents of the `CMD`, `CMD ["echo", "cwd: $PWD"]`, would be effectively replaced and a shell would be spawned. However, try running it with the following `Dockerfile`. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["cwd: $PWD"] -ENTRYPOINT ["echo"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage /bin/sh ----- - -What happens? It does not do what one might expect! It simply prints out "/bin/sh". Why? Well, the arguments after `docker run` do _not_ replace our `ENTRYPOINT`, just our `CMD`. So, in this case, we essentially ran `echo /bin/sh`! In fact, if you gave multiple parameters to `ENTRYPOINT` -- none of them would be replaced. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["cwd: $PWD"] -ENTRYPOINT ["echo", "$PWD"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage /bin/sh ----- - -.result ----- -$PWD /bin/sh ----- - -In this example, `ENTRYPOINT` is in _exec_ form -- it has the square brackets. `ENTRYPOINT` also has a _shell_ form. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["cwd: $PWD"] -ENTRYPOINT echo $PWD ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage /bin/sh ----- - -.result ----- -/ ----- - -Here, what is actually being run is `/bin/sh -c "echo $PWD"`. When `ENTRYPOINT` is run using _shell_ form, `CMD` is completely ignored, and, because `CMD` is completely ignored, the `/bin/sh` argument passed as a part of the `docker run` command is also ignored. A side effect is that that signals are not passed properly using this method, this will effect stopping the container and the first process running in the container. - -Hopefully this gives you a taste of the myriad of capabilities that `CMD` and `ENTRYPOINT` provide. It _is_ a mess, however, Docker does provide some "best practices". In a nutshell: - -. Stick to the _exec_ forms for _both_ `CMD` and `ENTRYPOINT`. -. If you want variable substitution to work, directly execute the shell of your choice. Some examples: -+ -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["/bin/bash", "-c", "echo cwd: $PWD"] ----- -+ -.Dockerfile ----- -FROM ubuntu:22.04 -ENTRYPOINT ["/bin/sh", "-c"] -CMD ["echo cwd: $PWD"] ----- - -For this question, submit the text for a `Dockerfile` that would result in the following output when run. - -[source,bash] ----- -docker run -it myfirstimage ----- - -.result ----- -/bin/bash ----- - -[source,bash] ----- -docker run -it myfirstimage /bin/sh ----- - -.result ----- -# you get an `sh` shell prompt in the container ----- - -[source,bash] ----- -docker run -it myfirstimage 'echo $SHELL -- cool' ----- - -.result ----- -/bin/bash -- cool ----- - -[source,bash] ----- -docker run -it myfirstimage "echo $SHELL -- cool" ----- - -.result ----- -/bin/zsh -- cool ----- - -[IMPORTANT] -==== -For this last example -- remember taht we are in the `zsh` shell ourselves, and that double quotes are _first_ interpreted by our current shell _before_ executed. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Okay, we are making some progress, but the complexity of `ENTRYPOINT` and `CMD` are certainly enough to slow us down a bit. Rebuild `myfirstimage` with the following `Dockerfile`. - -.Dockerfile ----- -FROM ubuntu:22.04 -CMD ["/bin/bash"] ----- - -[source,bash] ----- -docker build -t myfirstimage . -docker run -it myfirstimage ----- - -Once you are in a shell in your container, go ahead and run the following to create a file called "imhere" in the `/root/` directory. - -[source,bash] ----- -cd /root -touch imhere ----- - -Verify that the file exists: - -[source,bash] ----- -ls -la /root/ ----- - -Great! Now go ahead and `exit` the container. Now, rerun the container, and verify that the file is still there. - -[source,bash] ----- -docker run -it myfirstimage -ls -la /root/ ----- - -It is no longer there! The container executes our shell, we run some commands, and as soon as we `exit` the container our changes are all gone! Containers are ephemeral -- any changes you make will only survive for the duration that the process running the container exists. When we `exit` the container the process is no longer running and our changes disappear. - -There are a couple of ways around this. One is to use a `VOLUME` to bind a location outside of our container somewhere inside our container. We will play with this later on. Another way that is super straightforward is to not let our process exit! We can do this by using the `-d` flag to run the container in the background. Give it a try! - -[source,bash] ----- -docker run -dit myfirstimage ----- - -Whoa, this is way different -- there is a long string of characters that are printed, and then we have our regular shell prompt. What is going on? - -The long string of characters is the container ID. This is a unique identifier for the container. We can use this to interact with the container. - -Okay, great, but do we have to remember that? No! We don't! You can see all _running_ containers using the following command. - -[source,bash] ----- -docker ps ----- - -.output ----- -CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES -b34dd462b664 myfirstimage "/bin/bash" 40 seconds ago Up 39 seconds goofy_goldwasser ----- - -Here, we have an abbreviated container id as well as the name of the image that was used to create the container. Okay, now how do we interact with the container? We can use the `docker exec` command. This command allows us to execute a command inside of a running container. In this case, we want a shell, so let's give it a try. - -[source,bash] ----- -docker exec -it b34dd462b664 /bin/bash - -# or, since b34dd462b664 is hard to type, they give us a user friendly name we can use instead - -docker exec -it goofy_goldwasser /bin/bash ----- - -Great! Let's repeat the previous steps to create the file in `/root/` called "imhere". - -[source,bash] ----- -cd /root -touch imhere -ls -la /root/ ----- - -Now, let's exit the container. Is the container still running? Use `docker ps` to find out. Okay, great! It is still running -- that _should_ mean that if we get another shell inside the container and look for the file, it should still be there. Let's give it a try. - -[source,bash] ----- -docker exec -it goofy_goldwasser /bin/bash -ls -la /root/ ----- - -Indeed, it is! Excellent! While this is great, if, for some reason, the container _restarted_ this file would once again disappear. If we _really_ need some data to persist, we need to use a `VOLUME`, but we will mess around with this in a future project. What else can we do? What other commands are useful? Well, let's make our _running_ container stop. - -[source,bash] ----- -docker stop goofy_goldwasser ----- - -[TIP] -==== -Here is a cool feature -- we have the shell configured with a `docker` plugin -- this means we have autocompletion on `docker` related commands. For example, type only "docker stop goo" (or instead of "goo", the very beginning of your container name), then type the "tab" key -- it will autocomplete and type out the rest of the container name! This is _super_ useful! -==== - -After that, check on your container with `docker ps` -- you'll find it has stopped! Very cool. - -Remember how earlier in the project we mentioned that Docker images are just layers of a read-only filesystem compressed into a zipped up tarball? Well, up until this point I haven't seen any actual transferrable files, have you? With `docker` we have the ability to export them using the `docker save` command. Let's do this. - -[source,bash] ----- -docker save myfirstimage > myfirstimage.tar - -# or - -docker save -o myfirstimage.tar myfirstimage ----- - -The result is a tarball that you can transfer to any other system (at least, any with the same architecture) with `Docker` and use `docker load` to load it up. - -[source,bash] ----- -docker load < myfirstimage.tar -docker images - -# or - -docker load -i myfirstimage.tar -docker images ----- - -For this question, just include a screenshot of the terminal contents that demonstrates that you were able to persist the "imhere" file after exiting the container. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project13.adoc deleted file mode 100644 index 1a4b9824d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project13.adoc +++ /dev/null @@ -1,475 +0,0 @@ -= TDM 40200: Project 13 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. In the previous series of projects you've built a web application. While right now it may be easy to share and run your application with another individual, as time goes on and packages are updated, this is less and less likely to be the case. Containerizing your application ensures that the application will have the proper versions of the proper packages available in the proper location to run. - -**Context:** This is a third of a series of projects focused on containers. The end goal of this series is to solidify the concept of a container, and enable you to "containerize" the application you've spent the semester building. You will even get the opportunity to deploy your containerized application! - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use `docker` to build a container image. -- Understand the difference between the `ENTRYPOINT` and `CMD` commands in a `Dockerfile`. -- Use `docker` to run a container. -- Use `docker` to run a shell inside of a container. -- Use `docker` to containerize your dashboard application. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -The end goal of this project is to containerize your frontend and backend (into two different containers), and make sure that they can communicate with each other. The following is a rough sketch of the steps involved in this process, so you have a general idea what is next at each step. - -. On Anvil, launch and connect to your VM with Docker pre-installed. -. Copy the frontend and backend code from Anvil to your VM. -. Create a `Dockerfile` (or, what can more generically be referred to as a `Containerfile`) for each of the frontend and backend. -. Use `Docker` to build a container image for each of the frontend and backend. -. Run the containers and make sure they can communicate with each other. - -Ultimately, in the next project, you will be _deploying_ your frontend and backend on a Kubernetes cluster, https://www.rcac.purdue.edu/compute/geddes[Geddes], behind a URL! So, at the very end of this project, we will ask you to verify your access to Geddes (which you've _hopefully_ already been granted). - -For this question, simply prep your working environment. Launch a SLURM job, prop up your VM, and ensure you can connect to it. The only thing you need to submit is a screenshot showing that you can connect to your VM. - -. Get a terminal on Anvil -- you may complete this part however you like. I like to use `ssh` to connect to Anvil from my local machine, however, you may also use https://ondemand.anvil.rcac.purdue.edu, launch a Jupyter Lab session, and launch a terminal from within Jupyter Lab. Either works equally as well as the other. -. Clear out any potential SLURM environment variables: -+ -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- -+ -. Launch SLURM job with 8 cores and about 16 GB of memory and get a shell into the given backend node: -+ -[source,bash] ----- -salloc -A cis220051 -p shared -n 8 -c 1 -t 04:00:00 ----- -+ -[NOTE] -==== -This job will only buy you 4 hours of time on the backend node. If you need more time, you will need to re-launch the job and change the arguments to `salloc` to request more time. -==== -+ -. Once you have a shell on the backend node, you will need to load the `qemu` module: -+ -[source,bash] ----- -module load qemu ----- -+ -. Next, copy over a fresh VM image to use for this project: -+ -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/alpine.qcow2 $SCRATCH ----- -+ -[NOTE] -==== -If at any time you want to start fresh, you can simply copy over a new VM image from `/anvil/projects/tdm/apps/qemu/images/alpine.qcow2` to your `$SCRATCH` directory. Any changes you made to the previous image will be lost. This is good to know in case you want to try something crazy but are worried about breaking something! No need to worry, you can simply re-copy the VM image and start fresh anytime! -==== -+ -. The previous command will result in a new file called `alpinel.qcow2` in your `$SCRATCH` directory. This is the VM image you will be using for this project. Now, you will need to launch the VM: -+ -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/alpine.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::2200-:22 & ----- -+ -[NOTE] -==== -The last part of the previous command forwards traffic from port 2200 on Anvil to port 22 on the VM. If you receive an error about port 2200 being used, you can change this number to be any other unused port number. To find an unused port you can use a utility we have available to you. - -[source,bash] ----- -module use /anvil/projects/tdm/opt/core -module load tdm -find_port ----- - -The `find_port` command will output an unused port for you to use. If, for example, it output `12345`, then you would change the `qemu` command to the following. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4 -hda $SCRATCH/alpine.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::12345-:22 & ----- -==== -+ -. After launching the VM, it will be running in the background as a process (this is what the `&` at the end of the command does). After about 15-30 seconds, the VM will be fully booted and you can connect to the VM from Anvil using the `ssh` command. -+ -[source,bash] ----- -ssh -p 2200 tdm@localhost -o StrictHostKeyChecking=no ----- -+ -You may be prompted for a password for the user `tdm`. The password is simply `purdue`. -+ -[IMPORTANT] -==== -If in a previous step you changed the port from say `2200` to something like `12345`, you would change the `ssh` command accordingly. -==== -. Finally, you should be connected to the VM and have a new shell running _inside_ the VM, great! If you were successful, contents of the terminal should look very similar to the following. - -image::figure51.webp[Successfully connected to the VM, width=792, height=500, loading=lazy, title="Successfully connected to the VM"] - -[IMPORTANT] -==== -If at any time you would like to "save" your progress and restart the project at a later date or time, you can do this by exiting the VM by running the `exit` command. Next, type `jobs` to find the `qemu` job number (probably 1). Finally, bring the `qemu` command to the foreground by typing either `fg 1` or `fg %1` followed by Ctrl+c. This will kill the VM and you can restart the project at a later date or time by simply using the same `alpine.qcow2` image you used previously. -==== - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The next step is to copy the application `/anvil/projects/tdm/etc/project13` _and_ the database `/anvil/projects/tdm/data/movies_and_tv/imdb.db` to the VM (the database belongs in `/home/tdm` for this project). You can do this by using the `scp` command. `scp` uses `ssh` to securely transfer files between hosts. Remember, your VM is essentially another machine with open port 2200 for `ssh` (and `scp`). Figure out how to accomplish this task and then copy the application to the VM. - -For this question, submit a screenshot of the following on the VM. - -[source,bash] ----- -ls -la /home/tdm/project13/frontend -ls -la /home/tdm/project13/frontend/templates -ls -la /home/tdm/project13/backend/api ----- - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Create two `Dockerfile` files: - -- `/home/tdm/project13/frontend/Dockerfile` -- `/home/tdm/project13/backend/Dockerfile` - -As long as your images build and work correctly, you can use any base image you want. However, if you want the potential to get better/faster help (via Piazza), you should use the following base image: `python:3.11.3-slim-bullseye` (https://hub.docker.com/_/python/tags?page=1&name=3.11). - -Here are some general guidelines for your `Dockerfile` files. - -**Frontend** - -. Use the `python:3.11.3-slim-bullseye` base image. -. Optionally use the `WORKDIR` command to set an internal (to the container) working directory `/app`. -. Copy the `project13/frontend` directory to the container, maybe in the `/app` workdir. -+ -[TIP] -==== -You can use `COPY . /app/` to copy the contents of the current directory (the directory where your `Dockerfile` lives) to the `/app` directory in the container. -==== -+ -. Install the required Python packages using `pip`. -+ -[TIP] -==== -The following are the required Python packages: `httpx` and `"fastapi[all]"` (the double quotes are needed). -==== -+ -. Use `EXPOSE` to mark port 8888 as being used by the container. -. Use `CMD` or `ENTRYPOINT` to start the application. -+ -[TIP] -==== -Use the `--host` argument to `uvicorn` and specify `0.0.0.0` to broadcast on all network interfaces. -==== -+ -[TIP] -==== -Since you are running your application from a different perspective than before, you will need to modify `backend.endpoints:app` to `endpoints:app`. -==== - -[TIP] -==== -To build the image, you can use the following command. - -[source,bash] ----- -cd /home/tdm/project13/frontend -docker build -t client . ----- -==== - -**Backend** - -. Use the `python:3.11.3-slim-bullseye` base image. -. Optionally use the `WORKDIR` command to set an internal (to the container) working directory `/app`. -. Copy the `project13/backend` directory to the container, maybe in the `/app` workdir. -+ -[TIP] -==== -You can use `COPY . /app/` to copy the contents of the current directory (the directory where your `Dockerfile` lives) to the `/app` directory in the container. -==== -+ -. Install the required Python packages using `pip`. -+ -[TIP] -==== -The following are the required Python packages: `httpx`, `"fastapi[all]"`, `aiosql==7.2`, and `pydantic` (the double quotes are needed). -==== -+ -. Use `EXPOSE` to mark port 7777 as being used by the container. -. Use `VOLUME` to specify a mount point _inside_ the container. This will be where we will mount `imdb.db` so that our application can access the databse _outside_ of the container. You should use the location `/data`. -. Use `CMD` or `ENTRYPOINT` to start the application. -+ -[TIP] -==== -Use the `--host` argument to `uvicorn` and specify `0.0.0.0` to broadcast on all network interfaces. -==== -+ -[TIP] -==== -Since you are running your application from a different perspective than before, you will need to modify `frontend.api.api:app` to `api.api:app`. -==== - -[TIP] -==== -To build the image, you can use the following command. - -[source,bash] ----- -cd /home/tdm/project13/backend -docker build -t server . ----- -==== - -For this question, include the contents of both of your `Dockerfile` files in your submission. If you make mistakes and need to modify your `Dockerfile` files in future questions, please update your submission for this question to be the functioning `Dockerfile` files. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Okay, awesome! You now have a couple of container images built and available on your VM, named `client` and `server`. You should be able to see these images by running the following command. - -[source,bash] ----- -docker images ----- - -Okay, the next step is to _run_ both of the containers, making sure that they can communicate. Our ultimate goal here is to run the following command and get the following results. - -[source,bash] ----- -curl localhost:8888/people/nm0000148 ----- - -.results ----- - - - Harrison Ford - - - -
-
- nm0000148 -
-
- Harrison Ford -
-
- 1942 -
-
- None -
- -
- - ----- - -We want those results because it demonstrates, in a single command, a variety of important things: - -. We can access the frontend from the host machine (our VM). -. The frontend can access the backend. -. The backend can access the database. - -This is enough evidence for us to say that our containers are communicating properly and are good enough to deploy (in the next project). - -First thing is first. By default, Docker will add any running container to the `bridge` network. You can see this network listed by running the following. - -[source,bash] ----- -docker network ls ----- - -.output ----- -NETWORK ID NAME DRIVER SCOPE -6c21df067202 bridge bridge local -8acdd7457852 host host local -78e8c707cf0c none null local ----- - -In theory, if you ran our frontend on the network on 0.0.0.0:8888 and the server on the same network at 0.0.0.0:7777, they should be able to communicate. However, with the way we have our frontend configured in `endpoints.py`, it will not work. We can't just specify `localhost` and move on, instead, we would need to specify the actual IP address that the server is assigned on the `bridge` network. This is a bit of a pain, so we are going to create a new user network and run our containers on that network. This way, we can refer to other containers on the same network by their _name_ rather than their IP address. - -Let's create this network. We can call it anything, however, we will call it `tdm-net`. - -[source,bash] ----- -docker network create tdm-net ----- - -Upon success, you should see the network in your list of networks. - -[source,bash] ----- -docker network ls ----- - -.output ----- -NETWORK ID NAME DRIVER SCOPE -6c21df067202 bridge bridge local -8acdd7457852 host host local -78e8c707cf0c none null local -40574054296e tdm-net bridge local ----- - -Now, in order to run our client (frontend) and server (backend) on the `tdm-net` network, we just need to add `--net tdm-net` to our `docker run` commands. Great! - -**Frontend** - -[TIP] -==== -The `-p` flag is used to specify port mappings. The format is `host_port:container_port`. In this case, we are mapping port 8888 on the host to port 8888 on the container. -==== - -[TIP] -==== -It would be best to run this container using `-dit`, liked discussed in the previous project. -==== - -[TIP] -==== -Don't forget to run this container on the `tdm-net` network! -==== - -**Backend** - -[TIP] -==== -By default, we have `endpoints.py` setup to target our host with name `server` and port `7777`. For this to continue to work, you will want to specify the _name_ (which should be "server") of the server container using the `--name` argument with `docker run`. -==== - -[TIP] -==== -The `-p` flag is used to specify port mappings. The format is `host_port:container_port`. In this case, we are mapping port 7777 on the host to port 7777 on the container. -==== - -[TIP] -==== -It would be best to run this container using `-dit`, liked discussed in the previous project. -==== - -[TIP] -==== -Use the `--mount` argument to mount the `/home/tdm/imdb.db` database _outside_ of the container to `/data/imdb.db` _inside_ of the container. Remember, in the `Dockerfile` for the server we specified this location, `/data`, as a mount point for the database. The `type` of the mount is `bind`. See https://docs.docker.com/engine/reference/commandline/run/#mount[here] for more help. -==== - -[TIP] -==== -Don't forget to run this container on the `tdm-net` network! -==== - -**General tips** - -[TIP] -==== -You can see if your containers are running properly by running `docker ps`. You should see both containers running. -==== - -[TIP] -==== -If you need to tear down a running container _named_ `server`, in order to run a newer version of the container, you can run the following. - -[source,bash] ----- -docker kill server # when you use the --name server argument, this name replaces the automatically created names -docker rm server # otherwise, when trying to run a new container with the name server, you will get an error ----- -==== - -[TIP] -==== -If `curl http://localhost:8888/people/nm0000148` does not return what you expect -- you can figure out what is going on by peeking at the _frontend_ logs. You can do this by running the following. - -[source,bash] ----- -docker logs client # this assumes that you used the --name client argument ----- -==== - -[TIP] -==== -If you want to "pop into" a running container, for example, the client, you can do so by running the following. - -[source,bash] ----- -docker exec -it client /bin/bash ----- -==== - -[NOTE] -==== -You may be wondering _why_ we are using `VOLUME` and the `--mount` arguments. The reason why is that, if we were to include `imdb.db` _inside_ the container, via something like `COPY imdb.db /data/imdb.db`, then the database would _not_ be persisted in the case where the container is stopped or restarted. This is a _bad_ situation. To avoid this, we simply mount the `imdb.db` database file _outside_ of the container, on our persistent file system, to be available _inside_ our container. Although inside the container the database appears to be located at `/data/imdb.db`, it _actually_ lives `/home/tdm/imdb.db` on our host, the VM. - -It is very common to have a need to persist some type of data. When this is needed, look towards using `VOLUME` and `--mount`. -==== - -For this question, simply include a screenshot showing the successful `curl` command and output. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Finally, please verify that you have access to two resources for the next project (even if you don't plan on doing it). On the Purdue VPN or on a Purdue network, please visit the following links: - -- https://geddes.rcac.purdue.edu -- https://geddes-registry.rcac.purdue.edu - -**https://geddes.rcac.purdue.edu** - -. Login using your 2-factor authentication (Purdue Login on Duo Mobile). -. Click on the "geddes" name under the "Clusters" section. -. Click on the `Projects/Namespaces` under the "Cluster" tab on the left-hand side. -. Make sure you can see something like "The Data Mine - Students (tdm-students)". If you can, take a screenshot and you are done with this part. If you cannot, please email post in Piazza with your Purdue username and specify that you could _not_ see the Geddes project. - -**https://geddes-registry.rcac.purdue.edu** - -. Login using your Purdue alias and regular password. -. If you get logged in successfully, take a screenshot and you are done with this part. If you cannot, please post in Piazza with your Purdue username and specify that you could _not_ login to the Geddes registry. - -Include both screenshots for this question. If you failed on one or more of the steps, please just specify that you posted in Piazza and you will receive full credit. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project14.adoc deleted file mode 100644 index 462d3f20f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-project14.adoc +++ /dev/null @@ -1,234 +0,0 @@ -= TDM 40200: Project 14 -- 2023 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. In the previous series of projects you've built a web application. While right now it may be easy to share and run your application with another individual, as time goes on and packages are updated, this is less and less likely to be the case. Containerizing your application ensures that the application will have the proper versions of the proper packages available in the proper location to run. - -**Context:** This is a final of a series of projects focused on containers. The end goal of this series is to solidify the concept of a container, and enable you to "containerize" the application you've spent the semester building. You will even get the opportunity to deploy your containerized application! - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use `docker` to build a container image. -- Understand the difference between the `ENTRYPOINT` and `CMD` commands in a `Dockerfile`. -- Use `docker` to run a container. -- Use `docker` to run a shell inside of a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 - -[WARNING] -==== -Please be sure to start this project early in the week so resources are not constrained on Friday. -==== - -[WARNING] -==== -If at any time you get stuck on this project, please create a Piazza post. We are happy to help! -==== - -The ultimate goal of this project is to take your containerized web application from the previous project, and deploy it on Geddes, Purdue's Kubernetes cluster. In order to save you some time, we already performed operations akin to the following. These images are already available for you to use with this project. - -[source,bash] ----- -# change the name/tag of the previously tagged "client" image to the fully qualified tag -docker tag client geddes-registry.rcac.purdue.edu/tdm/student-client:0.0.1 -docker tag server geddes-registry.rcac.purdue.edu/tdm/student-server:0.0.1 - -# you can see that they are different now -docker images - -# login to the registry -docker login geddes-registry.rcac.purdue.edu - -# push the images to the registry -docker push geddes-registry.rcac.purdue.edu/tdm/student-client:0.0.1 -docker push geddes-registry.rcac.purdue.edu/tdm/student-server:0.0.1 ----- - -Login to geddes-registry.rcac.purdue.edu and verify that the images are there (they are there, we put them there). When you find the images, take a screenshot and include it in your submission. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 2 - -The rest of this project will be performed using a web browser and https://geddes.rcac.purdue.edu, and that is it! It should be pretty quick! You just have to follow the instructions! - -First thing is first. We need to create a place where we can store our `imdb.db` database that is _persistent_. You can imagine this as a thumb drive with our `imdb.db` file on it that we plug into our container and it appears at `/data/imdb.db`. If our container restarts, or is redeployed, we can just plug our thumb drive into that database with no data lost. - -Okay, great. To do this in Kubernetes, we need to create a persistent volume claim. - -. Click on Storage in the left hand menu. -. Click on PersistentVolumeClaims in the left hand menu. -. Click the large blue "Create" button in the top right. -. Fill out "Name" with `yourname` where you replace `yourname` with your actual name. -. Select "geddes-standard-multinode" for Storage Class. -. Write "15" in the Request Storage field. -. Click the blue "Create" button. - -Take a screenshot that shows your claim named `yourname` under the "tdm-students" namespace. Include this screenshot in your submission. - -Okay okay, with all of that being said, we will _not_ be using this for the project. We don't want you to worry about transferring the 10 GB database to this filestore. Instead, in the next question, when you are creating your deployment, you will select one we have that is already pre-loaded with the database. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 3 - -Next, it is time to deploy your server. - -. Click on Workload in the left hand menu. -. Click the large blue "Create" button in the top right. -. Click on "Deployment" from the 5 shown options. -. Fill out "Name" with `tdm-yourname-server` where you replace `yourname` with your actual name. -. Type "geddes-registry.rcac.purdue.edu/tdm/student-server:0.0.1" in the "Container Image" field. -. Click on the "Labels & Annotations" tab. -. Click the "Add Label" button under the "Pod Labels" section. -. Use the key `tdm-yourname-server-label` and value `tdm-yourname-server-label` where you replace `yourname` with your actual name. -. Click on the Resources tab. -. Put 200 in the CPU Reservation and CPU Limit fields. -. Put 1000 in the Memory Reservation and Memory Limit fields. -. Click on the Storage tab. -. Click on the Add Volume drop down. -. Select "Persistent Volume Claim" from the list. _Not_ "Create Persistent Volume Claim" -- we already did that. -. Name the volume `yourname-vol`, and select `kevin` from the Persistent Volume Claim dropdown. -+ -[NOTE] -==== -`kevin` is the volume that already has our database preloaded. -==== -+ -. Type `/data` and _nothing_ (leave it blank) for Mount Point and Sub Path in Volume, respectively. -. Click the big blue "Create" button in the bottom right. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 4 - -Next, it is time to deploy your client. - -. Click on Workload in the left hand menu. -. Click the large blue "Create" button in the top right. -. Click on "Deployment" from the 5 shown options. -. Fill out "Name" with `tdm-yourname-client` where you replace `yourname` with your actual name. -. Type "geddes-registry.rcac.purdue.edu/tdm/student-client:0.0.1" in the "Container Image" field. -. Click on "Add Variable" in the "Environment Variables" section. -. Add the Variable Name "SERVER_HOST" and Value "yourname-server.tdm-students.geddes.rcac.purdue.edu". -+ -[NOTE] -==== -Replace "yourname" with your actual name. Do _not_ put "tdm-yourname-server" in the value field -- we do not want to target the server container itself, but rather the server's _service_ (which have not yet created). -==== -+ -[NOTE] -==== -The client container looks for a "SERVER_HOST" environment variable so we know _where_ to make requests to. -==== -+ -. Click on the "Labels & Annotations" tab. -. Click the "Add Label" button under the "Pod Labels" section. -. Use the key `tdm-yourname-client-label` and value `tdm-yourname-client-label` where you replace `yourname` with your actual name. -. Click on the Resources tab. -. Put 100 in the CPU Reservation and CPU Limit fields. -. Put 1000 in the Memory Reservation and Memory Limit fields. -. Click the big blue "Create" button in the bottom right. - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -=== Question 5 - -Next, it is time to setup the _services_ for your server and client. Specifically you will create a load balancer service for each of your server and client. These load balancers will map a given URL to a specific container. We will be setting up a service to map `yourname-server.tdm-students.geddes.rcac.purdue.edu` to your server container, and `yourname.tdm-students.geddes.rcac.purdue.edu` to your client container. - -Ultimately, you will be able to open a browser and type in `http://yourname.tdm-students.geddes.rcac.purdue.edu/people/nm0000148` and see the client container's web interface! - -. Click on "Services" under "Service Discovery" in the left hand menu. -. Click on the large blue "Create" button in the top right. -. Click on "Load Balancer" from the 5 shown options. -. For Name, type in `yourname-server` where you replace `yourname` with your actual name. -. Put anything you'd like in the "Port Name" field. -. Put 7777 for both the Listening Port and Target Port. -. Click on the "Selectors" tab. -. Put `tdm-yourname-server-label` in the "Key" field and `tdm-yourname-server-label` in the "Value" field. Once typed in, you should see a green box popup showing that it was able to identify the container you are targeting -- our server container. -. Click on the "Labels & Annotations" tab. -. Click on "Add Annotation". -. Put `metallb.universe.tf/address-pool` in the "Key" field and `geddes-public-pool` in the "Value" field. -+ -[NOTE] -==== -This annotation uses the metallb library to tell the load balancer to use the `geddes-public-pool` address pool. This is the pool that research computing has configured to expose to the world, rather than just the Purdue network. So this means you can navigate to `yourname.tdm-students.geddes.rcac.purdue.edu` from your home computer, _without_ being on Purdue's VPN. If we instead used `geddes-private-pool`, you would only be able to access the service from within Purdue's network. -==== -+ -. Click on the large blue "Save" button in the bottom right. - -Your server should now be running at `yourname-server.tdm-students.geddes.rcac.purdue.edu:7777`. Congratulations! - -Next, we want to create a service for the _client_. - -. Click on "Services" under "Service Discovery" in the left hand menu. -. Click on the large blue "Create" button in the top right. -. Click on "Load Balancer" from the 5 shown options. -. For Name, type in `yourname` where you replace `yourname` with your actual name. -. Put anything you'd like in the "Port Name" field. -. Put 80 for the Listening Port and 8888 for the Target Port. -+ -[NOTE] -==== -This makes it so you do not need to navigate to `yourname.tdm-students.geddes.rcac.purdue.edu:8888` to access the client, but rather just `yourname.tdm-students.geddes.rcac.purdue.edu`. By default, when the port isn't specified, traffic is sent to port 80. Then, the traffic sent to port 80 is forwarded to port 8888 on our container. -==== -+ -. Click on the "Selectors" tab. -. Put `tdm-yourname-client-label` in the "Key" field and `tdm-yourname-client-label` in the "Value" field. Once typed in, you should see a green box popup showing that it was able to identify the container you are targeting -- our client container. -. Click on the "Labels & Annotations" tab. -. Click on "Add Annotation". -. Put `metallb.universe.tf/address-pool` in the "Key" field and `geddes-public-pool` in the "Value" field. -. Click on the large blue "Save" button in the bottom right. - -Finally, it is time to see if everything has been deployed! Open a browser and navigate to `http://yourname.tdm-students.geddes.rcac.purdue.edu/people/nm0000148`. If you see some sort of error message having to do with security, please make sure you are using `http` instead of `https`. If your browser keeps changing http to https, you can make it work by installing Firefox. Once installed, click on the menu, then Settings, then Privacy and Security, then "Don't enable HTTPS-Only Mode". Next, in the URL, type `about:config`, search for "browser.fixup.fallback-to-https", and set it to "false". Restart Firefox completely, and try navigating to `http://yourname.tdm-students.geddes.rcac.purdue.edu/people/nm0000148` again. It may take 30 seconds or so for the page to load -- we are using _very_ few resources! - -Once you see the proper web page, take a screenshot for your final submission. Make sure to include the URL in the screenshot! - -[WARNING] -==== -Please make sure to follow the steps below to free up resources for your fellow students! Thank you! -==== - -Finally, once you've acquired your screenshot, please delete your deployments. You can do this as follows. - -. Click on "Deployments" under "Workloads" in the left hand menu. -. Click on the three dots to the right of your deployment. **Make sure it is YOUR deployment.** -. Click on "Delete". -. Click on "Services" under "Service Discovery" in the left hand menu. -. Click on the three dots to the right of your service. **Make sure it is YOUR service.** -. Click on "Delete". - -.Items to submit -==== -- Code used to solve this problem. -- Output from running the code. -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-projects.adoc deleted file mode 100644 index 24c769705..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/40200/40200-2023-projects.adoc +++ /dev/null @@ -1,47 +0,0 @@ -= TDM 40200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$40200-2023-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2022 to Spring 2023. -==== - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm40100[https://piazza.com/purdue/fall2022/tdm40100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm40100/home[https://piazza.com/purdue/fall2022/tdm40100/home] - - -== Syllabus - -Navigate to the xref:spring2023/logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/office_hours.adoc b/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/office_hours.adoc deleted file mode 100644 index 179e99e5c..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/office_hours.adoc +++ /dev/null @@ -1,39 +0,0 @@ -= Office Hours Spring 2023 - -[IMPORTANT] -==== -Check here to find the most up to date office hour schedule. -==== - -[NOTE] -==== -**Office hours _during_ seminar:** Hillenbrand C141 -- the atrium inside the dining court + -**Office hours _outside_ of seminar, before 5:00 PM EST:** Hillenbrand Lobby C100 -- the lobby between the 2 sets of front entrances + -**Office hours _after_ 5:00 PM EST:** Online in Webex + -**Office hours on the _weekend_:** Online in Webex -==== - -Navigate between tabs to view office hour schedules for each course and find Webex links to online office hours. -//[NOTE] -//==== -//The schedule will be available here once it is finalized closer to the course start date. -//==== - -++++ - -++++ - - -== About the Office Hours in The Data Mine - -During Spring 2023, office hours will be in person in Hillenbrand Hall during popular on-campus hours, and online via Webex during later hours (starting at 5:00PM). Each TA holding an online office hour will have their own WebEx meeting setup, so students will need to click on the appropriate WebEx link to join office hours. In the meeting room, the student and the TA can share screens with each other and have vocal conversations, as well as typed chat conversations. You will need to use the computer audio feature, rather than calling in to the meeting. There is a WebEx app available for your phone, too, but it does not have as many features as the computer version. - -The priority is to have a well-staffed set of office hours that meets student traffic needs. **We aim to have office hours when students need them most.** - -Each online TA meeting will have a maximum of 7 other people able to join at one time. Students should enter the meeting room to ask their question, and when their question is answered, the student should leave the meeting room so that others can have a turn. Students are welcome to re-enter the meeting room when they have another question. If a TA meeting room is full, please wait a few minutes to try again, or try a different TA who has office hours at the same time. - -Students can also use Piazza to ask questions. The TAs will be monitoring Piazza during their office hours. TAs should try and help all students, regardless of course. If a TA is unable to help a student resolve an issue, the TA might help the student to identify an office hour with a TA that can help, or encourage the student to post in Piazza. - -The weekly projects are due on Friday evenings at 11:59 PM through Gradescope in Brightspace. All the seminar times are on Mondays. New projects are released on Thursdays, so students have 8 days to work on each project. - -All times listed are Purdue time (Eastern). diff --git a/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/ta_schedule.adoc b/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/ta_schedule.adoc deleted file mode 100644 index a4deff650..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/logistics/TA/ta_schedule.adoc +++ /dev/null @@ -1,6 +0,0 @@ -= Seminar TA Spring 2023 Schedule - -++++ - -++++ \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2023/logistics/schedule.adoc b/projects-appendix/modules/ROOT/pages/spring2023/logistics/schedule.adoc deleted file mode 100644 index 9d3795594..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/logistics/schedule.adoc +++ /dev/null @@ -1,129 +0,0 @@ -= Spring 2023 Course Schedule - Seminar - -Assignment due dates are listed in *BOLD*. Other dates are important notes. - -*Remember, only your top 10 out of 14 project scores are factored into your final grade. - -[cols="^.^1,^.^3,<.^15"] -|=== - -|*Week* |*Date* ^.|*Activity* - -|1 -|1/9 - 1/13 -|Monday, 1/9: First day of spring 2023 classes - - - -|2 -|1/16 - 1/20 -|Monday, 1/16: MARTIN LUTHER KING JR. DAY - No classes - -*Project #1 due on Gradescope by 11:59 PM ET on Friday, 1/20* - -*Syllabus Quiz due on *Gradescope by 11:59 PM ET on Friday, 1/20* - -*Academic Integrity Quiz due on *Gradescope by 11:59 PM ET on Friday, 1/20* - - -|3 -|1/23 - 1/27 -| *Project #2 due on Gradescope by 11:59 PM ET on Friday, 1/27* - - - -|4 -|1/30 - 2/3 -| *Project #3 due on Gradescope by 11:59 PM ET on Friday, 2/3* - -*Outside Event #1 due on Gradescope by 11:59 PM ET on Friday, 2/3* - - -|5 -|2/6 - 2/10 -|*Project #4 due on Gradescope by 11:59 PM ET on Friday, 2/10* - - - -|6 -|2/13 - 2/17 -| *Project #5 due on Gradescope by 11:59 PM ET on Friday, 2/17* - - - -|7 -|2/20 - 2/24 -|*Project #6 due on Gradescope by 11:59 PM ET on Friday, 2/24* - - - -|8 -|2/27 - 3/3 -|*Project #7 due on Gradescope by 11:59 PM ET on Friday, 3/3* - -*Outside Event #2 due on Gradescope by 11:59 PM ET on Friday, 3/3* - -|9 -|3/6 - 3/10 -|*Project #8 due on Gradescope by 11:59 PM ET on Friday, 3/10* - - - -|10 -|3/13 - 3/17 -|SPRING VACATION - No Classes - - - -|11 -|3/20 - 3/24 -|*Project #9 due on Gradescope by 11:59 PM ET on Friday, 3/24* - - - -|12 -|3/27 - 3/31 -|*Project #10 due on Gradescope by 11:59 PM ET on Friday, 3/31* - - - -|13 -|4/4 - 4/7 -|*Project #11 due on Gradescope by 11:59 PM ET on Friday, 4/7* - - - -|14 -|4/10 - 4/14 -|*Project #12 due on Gradescope by 11:59 PM ET on Friday, 4/14* - -*Outside Event #3 due on Gradescope by 11:59 PM ET on Friday, 4/14* - - -|15 -|4/17 - 4/21 -|*Project #13 due on Gradescope by 11:59 PM ET on Friday, 4/21* - - - -|16 -|4/24 - 4/28 -|*Project #14 due on Gradescope by 11:59 PM ET on Friday, 4/28* - -Saturday, 4/29: Last day of spring 2023 classes. - - - -| -|5/1 - 5/6 -|Final Exam Week - There are no final exams in The Data Mine. - - - -| -|5/9 -|Tuesday, 5/9: Spring 2023 grades are submitted to Registrar's Office by 5 PM Eastern - - - -|=== diff --git a/projects-appendix/modules/ROOT/pages/spring2023/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/spring2023/logistics/syllabus.adoc deleted file mode 100644 index d329545ec..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2023/logistics/syllabus.adoc +++ /dev/null @@ -1,323 +0,0 @@ -= Spring 2023 Syllabus - The Data Mine Seminar - -== Course Information - - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10200 - The Data Mine II, possible CRNs to be announced -TDM 20200 - The Data Mine IV, possible CRNs to be announced -TDM 30200 - The Data Mine VI, possible CRNs to be announced -TDM 40200 - The Data Mine VIII, possible CRNs to be announced -TDM 50100 - The Data Mine Seminar, CRN to be announced -|=== - -*Course credit hours:* 1 credit hour, so you should expect to spend about 3 hours per week doing work -for the class - -*Prerequisites:* -It is expected that students in TDM 10200 have completed TDM 10100 with a passing grade. (Similar expectations for progressing from TDM 20100 to 20200..., etc.) All students, regardless of background are welcome. TDM 50100 is geared toward graduate students and can be taken repeatedly. We can make adjustments on an individual basis if needed. - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. Stduents will also take the quizzes at the beginning of the semester on Brightspace -- link:https://datamine.purdue.edu[*The Data Mine's website*] - helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium--no meal swipe required), and 4:30 pm (link:https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later). All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings, but this is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. Attendance is not required. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -1. Discover data science and professional development opportunities in order to prepare for a career. -2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -3. Design efficient search strategies in order to acquire new data science skills. -4. Devise the most appropriate data science strategy in order to answer a research question. -5. Apply data science techniques in order to answer a research question about a big data set. - - - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Brightspace course page. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - - - -=== Attendance Policy - -While everything we are doing in The Data Mine this semester can be done online, rather than in person, and no part of your seminar grade comes from attendance, we want to remind you of general campus attendance policies during COVID-19. Students should stay home and contact the Protect Purdue Health Center (496-INFO) if they feel ill, have any symptoms associated with COVID-19, or suspect they have been exposed to the virus. In the current context of COVID-19, in-person attendance will not be a factor in the final grades, but the student still needs to inform the instructor of any conflict that can be anticipated and will affect the submission of an assignment. Only the instructor can excuse a student from a course requirement or responsibility. When conflicts can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. For unanticipated or emergency conflict, when advance notification to an instructor is not possible, the student should contact the instructor as soon as possible by email or by phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor's department because of circumstances beyond the student's control, and in cases of bereavement, quarantine, or isolation, the student or the student's representative should contact the Office of the Dean of Students via email or phone at 765-494-1747. Below are links on Attendance and Grief Absence policies under the University Policies menu. - - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title, Email -Shared email we all read, , datamine@purdue.edu -Kevin Amstutz, Senior Data Scientist and Instruction Specialist, kamstut@purdue.edu -Maggie Betz, Managing Director of Corporate Partnerships, betz@purdue.edu -David Glass, Managing Director of Data Science, dglass@purdue.edu -Kali Lacy, Associate Research Engineer, kqlacy@purdue.edu -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator ,nrm@purdue.edu -Kim Rechkemmer, Senior Program Administration Specialist, kimr@purdue.edu -Nick Rosenorn, Corporate Partners Technical Specialist, nrosenor@purdue.edu -Nicholas Lenfestey, Corporate Partners Technical Specialist, nlenfest@purdue.edu -Emily Hoeing, Corporate Partners Advisor, hoeinge@purdue.edu -Katie Sanders, Operations Manager, kmpechin@purdue.edu -Rebecca Sharples, Managing Director of Academic Programs & Outreach, rebecca@purdue.edu -Dr. Mark Daniel Ward, Director, mdw@purdue.edu - -|=== - - -*For the purposes of getting help with this 1-credit seminar class, your most important people are:* - -* *T.A.s*: Visit their xref:spring2023/logistics/TA/officehours.adoc[office hours] and use the link:https://piazza.com/[Piazza site] -* *Mr. Kevin Amstutz*, Senior Data Scientist and Instruction Specialist - Piazza is preferred method of questions -* *Dr. Mark Daniel Ward*, Director: Dr. Ward responds to questions on Piazza faster than by email - - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest email by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:spring2023/logistics/TA/officehours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available schedule. - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes. - - -== Assignments and Grades - - -=== Course Schedule & Due Dates - -xref:spring2023/logistics/schedule.adoc[Click here to view the Spring 2023 Course Schedule] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the Friday at 11:55 pm Purdue West Lafayette (Eastern) time. You will need to do 3 Outside Event reflections. - -We will have 14 weekly projects available, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won't hurt your grade. We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren't done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions. - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - - - -=== Grading Scale -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+. - -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - - - -=== Late Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can't do this if we are waiting for other students to turn their work in. - - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 14 projects available over the semester, and your best 10 will count. -* The 3 project grades that are dropped could be from illnesses, absences, travel, family -emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the -last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office -hours. We're waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in -Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the -comment sections of your project submission. You need to say who helped you and how they -helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be -the topic of the seminar and any office hours that week, and then the project will be due by -11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature -inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* Throughout the semester, The Data Mine will have many special events and speakers, typically happening in person so you can interact with the presenter, but some may be online and possibly recorded. -* These eligible opportunities will be posted on The Data Mine's website (https://datamine.purdue.edu/events/) and updated frequently. Feel free to suggest good events that you hear about, too. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. -* You are welcome to do all 3 reflections early. For example, you could submit all 3 reflections in September. -* You must submit your outside event reflection within 1 week of attending the event or watching the recording. -* Follow the instructions on Brightspace for writing and submitting these reflections. -* At least one of these events should be on the topic of Professional Development. These -events will be designated by "PD" next to the event on the schedule. -// * [Revised] For each of the 3 required events, write a minimum 1-page (double-spaced, 12-pt font) reflection that includes the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event--it is a reflection. The header of your reflection should not take up more than 2 lines! -* This semester you will answer questions directly in Gradescope including the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event--it is a reflection. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Be excited to challenge yourself and learn impressive new skills. Don't get discouraged if something is difficult--you're here because you want to learn, not because you already know everything! -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don't wait until the due date to start! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don't see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don't forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you -should keep a copy of the course schedule where you see it easily. -* Read your emails! - - - -== Purdue Policies & Resources - -=== Academic Guidance in the Event a Student is Quarantined/Isolated - -If you must miss class at any point in time during the semester, please reach out to me via email so that we can communicate about how you can maintain your academic progress. If you find yourself too sick to progress in the course, notify your adviser and notify me via email or Brightspace. We will make arrangements based on your particular situation. Please note the link:https://protect.purdue.edu/updates/video-update-protect-purdue-fall-expectations/[Protect Purdue fall 2022 expectations] announced on the Protect Purdue website. - -=== Class Behavior - -You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties--this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group. - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10200/20200/30200/40200/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who -have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody's completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I'll do #2, and he'll do #3: then we'll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an -unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else's computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else's answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else's outside event reflection rather than attending the event yourself. - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - - -*Purdue's Copyrighted Materials Policy:* - -Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally. -Notes taken in class are, however, generally considered to be "derivative works" of the instructor's presentations and materials, and they are thus subject to the instructor's copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose. - -=== Nondiscrimination Statement -Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue's nondiscrimination policy statement.] - -=== Students with Disabilities -Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247. - -If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you. - -=== Mental Health Resources - -* *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time. -* *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm. -* *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu. -* *If you're struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours. - -=== Violent Behavior Policy - -Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University's full violent behavior policy] for more detail. - -=== Diversity and Inclusion Statement - -In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points: - -* We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable. -* Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker's intention. -* We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves. - -=== Basic Needs Security Resources - -Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 - 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community). - -Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund]. - -=== Course Evaluation - -During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system. - -You may email feedback to us anytime at link:mailto:datamine@purdue.edu[datamine@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you! - -=== General Classroom Guidance Regarding Protect Purdue - -Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:https://purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace. - -=== Campus Emergencies - -In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course: - -* Brightspace or by e-mail from Data Mine staff. -* General information about a campus emergency can be found on the Purdue website: link:www.purdue.edu[]. - - -=== Illness and other student emergencies - -Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students. - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. - diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project01.adoc deleted file mode 100644 index fa77948d2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project01.adoc +++ /dev/null @@ -1,212 +0,0 @@ -= TDM 10200: Project 1 -- 2024 - -**Motivation:** In this project we are going to jump head first into The Data Mine. We will load datasets into our environment, and introduce some core programming concepts like variables, vectors, types, etc. As we will be "living" primarily in an IDE called Jupyter Lab, we will take some time to learn how to connect to it, configure it, and run code. - -**Context:** This is our first project this spring semester. We will get situated, configure the environment we will be using throughout our time with The Data Mine, and jump straight into working with data! - -**Scope:** Python, Jupyter Lab, Anvil - -.Learning Objectives -**** -- Learn how to run Python code in Jupyter Lab on Anvil. -- Read and write basic (csv) data using Python. -**** - -== Dataset(s) - -The questions in this project will use the following dataset(s): - -- `/anvil/projects/tdm/data/forest/ENTIRE_COUNTY.csv` - -== Readings and Resources - -[NOTE] -==== -Login to Anvil. -Navigate and login to https://ondemand.anvil.rcac.purdue.edu using your ACCESS credentials (and Duo Mobile). You may refer to https://the-examples-book.com/starter-guides/anvil/[Anvil Introduction] and https://the-examples-book.com/starter-guides/anvil/access-setup[Access Setup] to find out how to Login to Anvil. -==== - -== Questions - -=== Question 1 (2 pts) - -.. Copy the template into your home directory and save it, to start your Project 1. - -[NOTE] -==== -This year, the first step to starting any project should be to get a copy of our project template using File / Open from URL and downloading the template from: https://the-examples-book.com/projects/current-projects/_attachments/project_template.ipynb[https://the-examples-book.com/projects/current-projects/_attachments/project_template.ipynb] - -Open the project template and then *save it into your home directory*, in a new notebook named `firstname-lastname-project01.ipynb`. - -Fill out the project template, replacing the default text with your own information. If a category is not applicable to you (for example, if you did _not_ work on this project with someone else), put N/A. -==== - -[NOTE] -==== -To learn more about how to run various types of code using this kernel, see https://the-examples-book.com/projects/current-projects/templates[our template page]. -==== - - - -=== Question 2 (2 pts) - -[NOTE] -==== -In the upper right-hand corner of your Jupyter Lab notebook, you will see the current kernel for the notebook, `seminar` (do not use `seminar-r` for this project). If you click on the name of the kernel, then you will have the option to change kernels. - -We added a video about https://the-examples-book.com/projects/current-projects/templates#running-python-code-using-the-seminar-kernel[Running Python code using the seminar kernel] -==== - -[loweralpha] - -.. In the first cell, write: print("Hello World!") and then type Control-Return to run the Python code in the cell. The output: Hello World! should appear below the cell after you run it. - -[TIP] -==== -[source,python] ----- -print("Hello World!") ----- -==== - - -=== Question 3 (2 pts) - -[loweralpha] - -.. Ask the user to input an integer, and assign it to a variable named `num1` like this: -`num1 = int(input("Enter an integer: "))` - -.. Ask the user to input a second integer, using a second prompt, and store this second result into a variable named `num2`. - -.. Add the values of `num1` and `num2` and print a string that says: `The sum of the two numbers is: [result here]` - -[NOTE] -==== - -There are different data types in Python. Some of the built in types include: - -* Integer (int) -* Float (float) -* string (str) -* types can include list, tuple, range -* Mapping data type (dict) -* Boolean type (bool) - -Numeric - -. int - holds signed integers of non-limited length. -. long- holds long integers (exists in Python 2.x, deprecated in Python 3.x). -. float- holds floating precision numbers and it is accurate up to 15 decimal places. -. complex- holds complex numbers. - -String - a sequence of characters, generally strings are represented by single or double-quotes - -List - ordered sequence of data written using square brackets *[]* and commas *(,)*. - -Tuple - similar to a list but immutable. Data is written using a parenthesis *()* and commas *(,)*. - -Dictionary - an unordered sequence of key-value pairs. -==== - - -=== Question 4 (2 pts) - -Read this StackOverflow page: - -https://stackoverflow.com/questions/74969278/write-a-program-to-store-seven-fruits-in-a-list-entered-by-the-user - -.. Now declare a list named `myfruits` and make a loop that asks the user for the names of 5 fruits, and adds each fruit to the list. - -.. Print the list of the 5 fruits. (Any format of output is OK, as long as you print all 5 fruits.) - -[NOTE] -==== -When you have a `range` in Python, the last number in the range is not used. So, for instance, if you use `for i in range(1,5):` then Python will only ask for 4 fruits. In Python, you would need to use `for i in range(1,6):` since the last number in the range is ignored! -==== - -[TIP] -==== -We made a https://the-examples-book.com/programming-languages/python/inputtingdata[video about inputting data into Python in a Jupyter Lab session] -==== - - -=== Question 5 (2 pts) - -.. Read "/anvil/projects/tdm/data/forest/ENTIRE_COUNTY.csv" dataset in Python and assign it to a variable named forest -.. Use different methods to display the dataset information, for instance: -... info, info(), shape, size, columns, len - -[TIP] -==== -We added a https://the-examples-book.com/programming-languages/python/pandas-shape[video about the shape of Pandas Data Frames] -==== - -[TIP] -==== -It is OK to use Google to find webpages to help with our projects. Please document any webpages that you use for help, when you are working on the project. You need to list all such webpages in the project, either at the start of the template or within the question directly. For instance, when Dr Ward used Google to help with this question, this webpage was useful: - -https://note.nkmk.me/en/python-pandas-len-shape-size/ -==== - - -[TIP] -==== -To import the dataset for this question, this code should work: - -[source,python] ----- -import pandas as pd -forest = pd.read_csv("/anvil/projects/tdm/data/forest/ENTIRE_COUNTY.csv") ----- -==== - -[IMPORTANT] -==== -Submit your completed Project 1: one Jupyter notebook and one Python script file. - -Now that you are done with the project. For this course, we will turn in a variety of files, depending on the project. - -We will *always* require a Jupyter Notebook file built from the template described above. Jupyter Notebook files always end in an extension `.ipynb`. This file is our "source of truth", and it is what the graders will look at first, when grading the projects. - -If we are working Python, we will also need you to build a Python file (ending with a `.py` extension too). Please see the note below. -==== - -[TIP] -==== -We added a video about https://the-examples-book.com/projects/current-projects/submissions#how-to-make-a-python-file[How to make a Python file] -==== - -[NOTE] -==== -An `.ipynb` file is generated by first running every cell in the notebook, and then clicking the "Download" button from menu:File[Download]. - -In addition to the `.ipynb`, if a project uses Python code., you will need to also submit a Python script. A Python script is just a text file with the extension `.py`. - -Let's practice. Take the Python code from this project and copy and paste it into a text file with the `.py` extension. Call it `firstname-lastname-project01.py`. Download your `.ipynb` file -- making sure that the output from all of your code is present and in the notebook. (The `.ipynb` file will also be referred to as "your notebook" or "Jupyter notebook".) - -Once complete, submit your notebook, and submit your Python script. You need to submit them to Gradescope together, as one submission, at the same time, because Gradescope only keeps track of the last submission to each project. -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this. -==== - -Project 01 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments, and output for the assignment - ** `firstname-lastname-project01.ipynb`. -* Python file for the assignment - ** `firstname-lastname-project01.py`. -* Submit your files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project02.adoc deleted file mode 100644 index 4c0a2b22b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project02.adoc +++ /dev/null @@ -1,133 +0,0 @@ -= TDM 10200: Project 2 -- 2024 - -**Motivation:** Pandas will enable us to work with data in Python. If you were enrolled in The Data Mine in the fall semester, you will recognize some similarities to data frames from R. If you are new to The Data Mine, you will likely find that Pandas makes it easy to work with data. Matplotlib is a widely-used Python library for creating visualizations in Python. - -**Context:** This is our second project and we will continue to introduce some basic data types, basic operations using pandas and matplotlib - -**Scope:** tuples, lists, pandas, matplotlib - -.Learning Objectives -**** -- Familiar with python data types -- Basic panda operations -- Basic matplotlib operations -**** - -== Dataset(s) - -You will use the following dataset(s) for questions - -- `/anvil/projects/tdm/data/craigslist/vehicles.csv` - -== Readings and Resources - -* Make sure to read about, and use the template found xref:templates.adoc#option-1[here], and the important information about projects submissions xref:submissions.adoc[here]. - -* Please review the following Examples Book pages before you start the project, and be sure to try some of these examples! These will help you be prepared for the project questions below. -- https://the-examples-book.com/programming-languages/python/tuples[Python tuples] -- https://the-examples-book.com/programming-languages/python/lists[Python lists] -- https://the-examples-book.com/programming-languages/python/pandas-dataframes[Python pandas DataFrames] -- https://the-examples-book.com/programming-languages/python/matplotlib[matplotlib] - - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] -.. Create a list called `mydata` that contains 6 tuples. Each tuple should have a student's first name, age and major. (You may make up the students' information.) -.. Use a https://the-examples-book.com/programming-languages/python/pandas-dataframes#dataframe-constructor[DataFrame Constructor] to convert `mydata` into a DataFrame named `studentDF`. -.. Use "iloc[]" to extract and display the second student's information in the DataFrame - -[TIP] -==== -You may get more information about "iloc[]" https://www.w3schools.com/python/pandas/ref_df_iloc.asp[here] -==== - -[TIP] -==== -We added a https://the-examples-book.com/programming-languages/python/pandas-dataframe-constructor[video about using the Pandas Data Frame Constructor] -==== - -=== Question 2 (2 pts) - -[WARNING] -==== -For question 2, when you run: -[source,python] ----- -import pandas as pd -myDF = pd.read_csv("/anvil/projects/tdm/data/craigslist/vehicles.csv") ----- -You need to use 3 cores in your Jupyter Lab session. If you started your Jupyter Lab session with only 1 core, just close your Jupyter Lab session and start a new session that uses 3 cores. Otherwise, your kernel will crash when you load the data. - -We added a video about https://the-examples-book.com/starter-guides/anvil/starting-an-anvil-session[starting an anvil session with more cores] -==== - -[loweralpha] - -.. Read in the dataset `/anvil/projects/tdm/data/craigslist/vehicles.csv` into a `pandas` DataFrame called `myDF`. (Optional: If you want to, you can use the first column `id` as the DataFrame's index, but this is not required.) -.. Display the first and last five rows of the `myDF` DataFrame. - -[TIP] -==== -[source,python] ----- -.head() -.tail() ----- -==== - - -=== Question 3 (2 pts) - -[loweralpha] - -.. Display how many rows and columns there are in the entire DataFrame `myDF`. -.. Display a list of all the column names in the DataFrame `myDF`. - -[TIP] -==== -You can revisit the functions given in Project 1, Question 5, to help with both parts of this question. -==== - -=== Question 4 (2 pts) - -Use the data from `myDF` to answer the following questions: - -[loweralpha] -.. How many vehicles have a price that is strictly larger than $6000? -.. How many vehicles are from Indiana? How many are from Texas? -.. Display all of the regions listed in the data frame. You can use the `unique()` method on the `region` column of `myDF`. How many different regions appear altogether (counting each region just once)? - -[TIP] -==== -We added a https://the-examples-book.com/programming-languages/python/pandas-breweries-examples#breweries-per-state[video about counting the number of entries per state] (This is a different data set than the vehicles data, but it should help guide you about how to solve Question 4, because we are still counting items per state, just using breweries instead of vehicles, but the method is the same.) -==== - -=== Question 5 (2 pts) - -[loweralpha] -.. Plot a bar chart that illustrates the number of vehicles in each state, whose price is strictly lower than $6000. The bar chart should show the number of each of these vehicles in each state. - -[TIP] -==== -We added a two part video about making such a bar chart. See the https://the-examples-book.com/programming-languages/python/pandas-breweries-examples#part-1-video[part 1 video] and the https://the-examples-book.com/programming-languages/python/pandas-breweries-examples#part-2-video[part 2 video]. Note: The example videos are about the number of reviews per user (instead of the number of vehicles per state), but the method is the same, and these videos should help to guide your work on Question 5. -==== - -Project 02 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project02.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project02.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project03.adoc deleted file mode 100644 index 6bee9922a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project03.adoc +++ /dev/null @@ -1,164 +0,0 @@ -= TDM 10200: Project 3 -- 2024 - -**Motivation:** Learning about Big Data. When working with large data sets, it is important to know how we can use control flow to find our information, a little bit at a time, without reading in all of the files at once. Control flow is the order that your code runs. - - -**Scope:** Python, Control Flow, if statements, for loops - -== Dataset(s) - -/anvil/projects/tdm/data/noaa - -== Readings and Resources - -[NOTE] -==== - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- Please review https://the-examples-book.com/programming-languages/python/control-flow[this] Examples Book content about Control Flow -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/looping-through-files[some videos] to help you with Project 3. -==== - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] - -.. Explore the files in the provided data set directory. Find out how many years are included in the data set. Briefly describe the contents of the files. -.. Import `pandas` and `pathlib` using: `import pandas as pd` and also `from pathlib import Path` -.. Create a list named `myfiles`, to hold `Path` objects from `1880.csv` to `1883.csv` in the data set folder using `list comprehension`. You can start with the following sample code (below), but *you need to modify this for loop*, to use `list comprehension`. -+ -[TIP] -==== -Following is the sample code that will return a "Path" object for the file `1750.csv`. -[source,python] -Path("/anvil/projects/tdm/data/noaa/1750.csv") - -You can start with a for loop, to create a list of Path objects, as follows, BUT we want you to modify this example, to use `list comprehension`. -[source,python] -myfiles=[] -for year in range (1880, 1884): - file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') - myfiles.append(file) -print(myfiles) -==== - -=== Question 2 (2 pts) - -.. Calculate how many records are in the file `1880.csv`. (Each line is one record.) -+ -[TIP] -==== -The following is the sample code to calculate records in one sample file object named `file`: -[source, python] -with open(file,"r") as f: - mycount = 0 - for line in f: - mycount += 1 -print(f'There are {mycount} records in the file called {file}') - -There are 370779 records in the file called /anvil/projects/tdm/data/noaa/1880.csv -==== -.. Calculate how many records there are (altogether) in the 4 files from `1880.csv` to `1883.csv`. Use the list `myfiles` that you created in Question 1. Your output should give the total number of records altogether, so it should say something like: - -There are [put your number of records here] records in the 4 files altogether. - -[TIP] -==== -- You may use a for loop to iterate over the `myfiles` object, like this: -[source,python] -for file in myfiles: - ...# body of the for loop -==== - - -=== Question 3 (2 pts) - -.. Run the following statement, to read in the first file from the list `myfiles` into a DataFrame using `myDF = pd.read_csv(myfiles[0])`. Display the column names for `myDF`. Look at the head and tail of `myDF`. Do you see anything unexpected? -.. Please modify your work from Question 3a, to correct the problem that you found. What are the column names now? Hint: Using the `header=None` argument will be useful. -.. Now let us add these 7 column names: `id`, `date`, `element_code`, `value`, `mflag`, `qflag`, `sflag`, and `obstime` to the data frame. You can do this using: `pd.read_csv(myfiles[0],names = ["id","date","element_code","value","mflag","qflag","sflag","obstime"])` -.. Make a list called `mydataframes` (of length 4) that contains 4 data frames, one for each year, from `1880.csv` to `1883.csv`. Starting with the sample code (above) for reading in the first file, modify our example, so that you have a "for" loop that reads in all 4 files. Test your work with a `for` loop that displays the column names of each of the four data frames in `mydataframes`. You can show the column names of `myDF` using `myDF.columns`. - -=== Question 4 (2 pts) - -Let's look at the column `element_code`. Use a loop to solve the following questions for all 4 DataFrames: - -.. Print out the (unique) elements of the column `element_code` (i.e., show each element just one time). -.. Find the number of times that `SNOW` occurs in the `element_code` column. - -[TIP] -==== -- The method `unique()` will be useful to calculate unique values. -- You may use different methods to find the number of times that `SNOW` occurs, for instance, `len()`, `value_counts()`, `sum()`, etc. -==== - - -=== Question 5 (2 pts) - -Now let us practice using the `chunksize` feature for big data. You may refer to https://www.geeksforgeeks.org/how-to-load-a-massive-file-as-small-chunks-in-pandas/[this document], to get more information about `chunksize`. - -.. Try to run the following 2 programs, to find the number of times that `SNOW` occurs in the `element_code` column, from the year 1880 to year 1883. Explain your understanding of `chunksize`. - -Pre-work for the programs: - -[source, python] ----- -import pandas as pd -from pathlib import Path -myfiles=[] -for year in range (1880, 1884): - file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') - myfiles.append(file) ----- - -Version 1 of the program: - -[source, python] ----- -count = 0 -for file in myfiles: - for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000): - count += len(myDF[myDF['element_code'] == 'SNOW']) - -print(count) ----- - -Version 2 of the program: - -[source,python] ----- -count = 0 -for file in myfiles: - for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000): - for index, row in myDF.iterrows(): - if row['element_code'] == 'SNOW': - count += 1 - -print(count) ----- - - - -Project 03 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project03.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project03.py` - -* Submit files through Gradescope -==== - - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project04.adoc deleted file mode 100644 index cff034d19..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project04.adoc +++ /dev/null @@ -1,79 +0,0 @@ -= TDM 10200: Project 4 -- 2024 - -**Motivation:** In the last project, we began exploring control flow, including if statements and for loops. In this project, we will delve deeper into the concept of loops. There are three main types of loops that we will focus on: for loops, while loops, and nested loops. - -**Context:** We will continue to work on some basic data types and will discuss some similar control flow concepts. - -**Scope:** loops, basic data structures such as tuples, lists, loops, dict - -== Dataset - -All questions in this project will use the data sets in this directory: - -`/anvil/projects/tdm/data/noaa/` - -Just as we did in Project 3, please remember to use `header=None` when you read in the data set, and remember to use: - -`names=["id","date","element_code","value","mflag","qflag","sflag","obstime"]` - -to create the column names. - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/some-examples-for-TDM-10200-project-4[three new videos] to help you with Project 4. -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Use Pandas iterrows to loop over the `1880.csv` data set. For any row that has information about precipitation (in the element code), if the value column in that row is more than 1200, print the date for that row. (Hint: Ten rows meet this condition, so you should print a total of 10 dates.) -.. Same question, also using the `1880.csv` data set, but now we would like you to use indexing to answer the same question from question 1a. -.. Which of these two approaches is faster? Which one do you understand more? - -=== Question 2 (2 points) - -[loweralpha] -.. Write a for loop that displays the average precipitation per year, for each year from 1800 to 1850. On each line of your output, print the year and the average precipitation for that year. -.. Change your for loop to a while loop, which prints the average precipitation, for each year from 1800 to 1850 BUT stops printing after the first year with average precipitation 22 or higher. (Hint: You will see that, because of the behavior of your while loop, it should print the average precipitation for the years 1800 to 1813.) - -=== Question 3 (2 points) - -[loweralpha] -.. For the `1880.csv` data, find the average precipitation for each `id`. Which `id` has the largest average precipitation? (Hint: The average precipitation for this `id` is 610; which `id` has that largest average precipitation?) -.. What is the average precipitation for the `id` `USC00288878`? - -=== Question 4 (2 points) - -[loweralpha] -.. Change the results from question 3a into a dictionary. (Hint: Depending on how you solved question 3a, if you did it like Dr Ward did it, you probably got a series in question 3a, and you can probably use the `to_dict()` method to convert the series into a dictionary.) - -=== Question 5 (2 points) - -[loweralpha] -.. You have used while loops and for loops on this project. Please choose any two data sets that we have posted on Anvil (they do not have to be NOAA data sets from `/anvil/projects/tdm/data/noaa/`. It is totally OK to use any data set in `/anvil/projects/tdm/data/`) and use a while loop or a for loop and explore the data. Show us something cool that you learned about these two data sets that you picked. Anything is OK; you have some freedom to explore! - -Project 04 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project04.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project04.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project05.adoc deleted file mode 100644 index 05a391eaf..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project05.adoc +++ /dev/null @@ -1,82 +0,0 @@ -= TDM 10200: Project 5 -- 2024 - -**Motivation:** Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use. - - -**Context:** Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools. - -**Scope:** functions - -== Datasets - -`/anvil/projects/tdm/data/noaa/{year}.csv` - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- We strongly encourage you to (please) look at our many examples of https://the-examples-book.com/programming-languages/python/writing-functions[Python Functions] -- Dr Ward added 4 additional videos that are https://the-examples-book.com/programming-languages/python/writing-functions-page-2[specific to Project 5] - -== Questions - -=== Question 1 (2 points) - -[loweralpha] - -.. Write a function called `avg_aggreg_temp` that takes 5 parameters: the `file_location` as a string, the `column_title_list` as a list of column titles, the `start_date` as an integer, the `end_date` as an integer, and the temperature `element_code` as a string, with default value `"TAVG"`. Your function should output the average `value` for rows with that `element_code`. For instance, in the default case (where `element_code` is `"TAVG"`), your function should output the average `value` which is the average of the average temperatures (as a decimal number). -.. Run the function for on the data set `2018.csv`, using `start_date` 20180101 and `end_date` 20180115, and with `"TAVG"` as the `element_code`. - -[NOTE] -==== -- Remember that the NOAA data does not include column titles, so use `column_title_list=["id","date","element_code","value","mflag","qflag","sflag","obstime"]` -- The data set is approximately 2.2 GB. It is OK to use only 1 core for your Jupyter Lab session, if you set chunksize = 10000 as you read in data. -- Input the `start_date` and the `end_date` as integers in the form `yyyymmdd` -- You will need to include a docstring that clearly explains how the function is defined. -==== - - -=== Question 2 (2 points) - -.. Create a function that takes a list of years (or, if you prefer, a list of file locations), as a list of column names, and an `element_code` as input, and returns a dictionary with one entry per year. In the dictionary, for each year, it should have the year as the key and the average value of the specified `element_code` as the value for that year. -.. Test your function for the `element_code` `"TAVG"` and for the range of four years 1880 to 1883 (inclusive), i.e., `range(1880,1884)`. - -[TIP] -==== -- You can EITHER use a list of years or a list of file locations for the input, but please explain all of your work, and be sure to provide documentation about how a person uses your function. Documentation is really important! -- Include column titles `["id","date","element_code","value","mflag","qflag","sflag","obstime"]` -==== - -=== Question 3 (2 points) - -.. Modify the function that you created in Question 2, to include an extra parameter for the month. This function should have the same behavior as in Question 2, but for each year, the function should only use the data from that month of the year. -.. Test your function for the `element_code` `"TAVG"` and for the range of four years 1880 to 1883 (inclusive), i.e., `range(1880,1884)`, and for the month August in each year. - -=== Question 4 (2 points) - -.. Create a function that takes a list of years as input, and identifies the year that has the most `qflags` of the type that the user specified. -.. Run the function for years in the range 1880 to 1883, and test it with some various `qflag` values, such as D, G, I, K, L, N, O, S, X. - - -=== Question 5 (2 points) - -.. Explore the dataset files from the `noaa` directory, and create a function of your own design, about something that interests you. Make sure to include a docstring that explains the function's definition. -.. Run your function, and explain the inputs and outputs, so that a user can understand how it works. - - -Project 05 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project05.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project05.py` - -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project06.adoc deleted file mode 100644 index 9390cf059..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project06.adoc +++ /dev/null @@ -1,111 +0,0 @@ -= TDM 10200: Project 6 -- 2024 -**Motivation:** Once we have some data analysis working in Python, we often want to wrap it into a function. Dr Ward usually tests anything that he wrote (usually 5 times), to make sure it works, before wrapping it into a function. Once we are sure our analysis works, if we wrap it into a function, it can usually be easier to use. - - -**Context:** Functions also help us to put our work into bite-size pieces that are easier to understand. The basic idea is similar to functions from R or from other languages and tools. - -**Scope:** functions - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -- https://the-examples-book.com/programming-languages/python/writing-functions[Python Functions] - -- https://realpython.com/sort-python-dictionary[sorting a dictionary in Python] - - -== Datasets -/anvil/projects/tdm/data/election/ - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/some-examples-for-TDM-10200-project-6[seven new videos] to help you with Project 6. -==== - -== Questions - -=== Question 1 (2 points) - -[NOTE] -==== -When you first read the project, you might think, WOW!, this project is long, but it isn't actually very long at all. Dr Ward just wanted to make sure that students understand the way to think about functions, namely: You need to *first* make sure that you understand what to do, and then check it a few times, and then (afterwards) write the function, and (finally) verify the function. [It is usually too hard to write the function without (first) understanding the work itself.] - -Dr Ward broke all of this down into steps, and the sentences look long, but don't be scared. Dr Ward is just breaking things down into little steps for you, so that the work is easier to do. -==== - -Functions should have simple inputs and should create helpful outputs. That way, a person can use the function to get good things done, without having to remember the details. For the election data, it is hard to remember where the files are located, and what the column names should be, etc. This question will create a useful function that only requires the user to start with 1 year as input, and it returns a dataframe as the output. That dataframe contains all of the election data for that year. - -[loweralpha] -.. First (without a function!) start with a variable called `myyear`. It can be an election year, and it helps if you try this with a few of the early 1980's elections years, so that they data is not-too-big. For instance, try `myyear` with each of the years 1980, 1984, and 1988 (one at a time, not a list). Once you set the value of `myyear`, then make a variable for the path to the data from that year, for instance, `/anvil/projects/tdm/data/election/itcont1984.txt` -.. Read in the data from the year stored in `myyear`. Please note that the data does not have column headers built-in, so you need to add the column headers, as in the "tip" below. Please also note that the elements in the data set has `|` between the data elements, not commas (this is why the file name ends in `.txt` not `.csv` in this case). So you need to use the parameter `delimiter='|'` in your `read_csv` function. -.. Look at the head of the data set, after you read it in. Does it look correct? Does everything make sense? Try this for each of the years, 1980, then 1984, then 1988, one at a time, storing each of these values into the variable `myyear` and then read in the data for that year, and check the head of the data each time. This is good practice for making sure that your work is designed properly. -.. Now that you are sure that your work is OK, make a function called `read_election_year` that takes one parameter called `myyear` as the input, and returns a data frame that contains the data from that election year. Make sure to document your function with a docstring to explain how it works. -+ -[TIP] -==== -This might be helpful for reading in your data! - -[source, python] ----- -mycolumnnames=["CMTE_ID","AMNDT_IND","RPT_TP","TRANSACTION_PGI","IMAGE_NUM","TRANSACTION_TP","ENTITY_TP","NAME","CITY","STATE","ZIP_CODE","EMPLOYER","OCCUPATION","TRANSACTION_DT","TRANSACTION_AMT","OTHER_ID","TRAN_ID","FILE_NUM","MEMO_CD","MEMO_TEXT","SUB_ID"] - -mydictionarytypes = {"CMTE_ID": str, "AMNDT_IND": str, "RPT_TP": str, "TRANSACTION_PGI": str, "IMAGE_NUM": str, "TRANSACTION_TP": str, "ENTITY_TP": str, "NAME": str, "CITY": str, "STATE": str, "ZIP_CODE": str, "EMPLOYER": str, "OCCUPATION": str, "TRANSACTION_DT": str, "TRANSACTION_AMT": float, "OTHER_ID": str, "TRAN_ID": str, "FILE_NUM": str, "MEMO_CD": str, "MEMO_TEXT": str, "SUB_ID": int} - -myDF = pd.read_csv("/anvil/projects/tdm/data/election/itcont1980.txt", delimiter='|', names=mycolumnnames, dtype=mydictionarytypes) - -myDF['TRANSACTION_DT'] = pd.to_datetime(myDF['TRANSACTION_DT'], format="%m%d%Y") ----- - -It might also help to have 2 cores for this project. You might be able to do it with 1 core, but it is probably easier for you with 2 cores. -==== - -=== Question 2 (2 points) - -.. First (without a function!) start with a variable called `myyear`, such as 1980, and find the number of (unique) committees that appear in the `CMTE_ID` column in that year. Then do the same for the year 1984, and then do this again for 1988. Print your results for each of these three years in separate cells. -.. Now that you have part 2a working well, put your work from question 2a into a function. Namely, create a function called `committees_function` that accepts a year as input, and returns the number of (unique) committees that appear in the `CMTE_ID` column in that year. Use the function designed in Question 1 to help you accomplish this work. -.. Test your function for each of the years 1980, 1984, and 1988. How many (unique) committees appear in each of these 3 individual years? The output from this question should show, for each year, how many (unique) committees appear in the data for each of those 3 years. The output for each of these 3 years should agree with your output from question 2a. - -[WARNING] -==== -Do not add the results from the three years. Instead, use three separate cells to show the three separate years of output. -==== - - -=== Question 3 (2 points) - -The goal of this question is to find the top 5 states in a given year, according to the total (sum) of the values in the `TRANSACTION_AMT` column. - -.. First (without a function!) start with a variable called `myyear`, such as 1980, and find the total (sum) of the values from the `TRANSACTION_AMT` column for each state in the data set. You only need to print the top 5 results (i.e., the top 5 states and the total of the transaction amounts from those states) for 1980. Then do this again for 1984, and then do this again for 1988. -.. Now that you have your work from Question 3a working well, build a function called `top_five_states`. This function should take 1 year as input, and should return the top 5 states and the total (sum) of the values for each of the 5 states, from the `TRANSACTION_AMT` column (for that state). - - -=== Question 4 (2 points) - -The goal of this question is to identify the top 5 employers, according to the total (sum) of the values from the `TRANSACTION_AMT` column for each employer. - -.. First find the top 5 employers in each year 1980, 1984, and 1988, and print the top 5 for each of those years. Do this *before* you make a function. -.. Once that is working, then build a function called `top_employers` that returns the top 5 employers in each year 1980, 1984, and 1988. Your results from question 4b should agree with your results from question 4a. - -=== Question 5 (2 points) - -.. Experiment with the election data for the same 3 years as above (1980, 1984, 1988). Identify something that you find interesting each of those 3 years *before* you build a function. -.. Wrap your interesting working into a function, and make sure that it matches your work from question 5a, for each of the 3 years. - -Project 06 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project06.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project06.py` - -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project07.adoc deleted file mode 100644 index f83fd21f7..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project07.adoc +++ /dev/null @@ -1,148 +0,0 @@ -= TDM 10200: Project 7 -- 2024 - -**Motivation:** Pandas is a powerful tool to manipulate and analysis data in Python. We can use Pandas to extract information about one or more variables on data frames, sometimes we can group the data according to one variable and summarizing another variable within each of those groups. - -**Context:** Understanding how to use Pandas and be able to develop functions allows for a systematic approach to analyzing data. - -**Scope:** Pandas and functions - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://the-examples-book.com/programming-languages/python/writing-functions[Python Functions] -- https://the-examples-book.com/programming-languages/python/pandas-series[pandas Series] -- https://the-examples-book.com/programming-languages/python/pandas-aggregate-functions[pandas aggregation functions] - -== Datasets - -`/anvil/projects/tdm/data/noaa` - -Just as we did in Project 3, please remember to use `header=None` when you read in the data set, and remember to use: - -`names=["id","date","element_code","value","mflag","qflag","sflag","obstime"]` - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/some-examples-for-TDM-10200-project-7[six new videos] to help you with Project 7. -==== - -== Questions - -[TIP] ----- -For all questions in this project, consider only the rows from the NOAA data that are for US records, namely, in which the first field starts with the letters `US`. ----- - -[TIP] -==== -Dr Ward used this code in the videos, which might help you during your work: - -[source,python] ----- -def get_noaa_data (myyear: int) -> pd.DataFrame: - """ - This function accepts a 4-digit year as input, and returns a data frame that contains the NOAA data for that year - - Args: - myyear (int): This is a 4-digit year for which we will load a data frame to be returned. - - Returns: - myDF (pd.DataFrame): This is the data frame that contains the NOAA data for that year - """ - myfilepath = f'/anvil/projects/tdm/data/noaa/{myyear}.csv' - mycolumnnames=["id","date","element_code","value","mflag","qflag","sflag","obstime"] - myDF = pd.read_csv(myfilepath, names=mycolumnnames) - myDF['date'] = pd.to_datetime(myDF['date'], format="%Y%m%d") - return myDF ----- -==== - - -=== Question 1 (2 points) - - -[loweralpha] -.. For the year 1880, find how many rows are for US records. Then do this again (in separate cells) for the years 1881, 1882, 1883. -.. Now make a dictionary that has these years as the four keys, and has the counts (that you discovered in question 1a) as the values. -.. Now that you know how to do the work from questions 1a and 1b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output. -.. Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 1a and 1b. - -[TIP] -==== -The following sample code can be used to get US records - -[source,python] ----- -us_records = df[df['id'].str.startswith('US')] ----- - -The shape of a data frame contains the number of rows and the number of columns. - -The range of years to test is `range(1880,1884)`; remember that this will include 1880 to 1883 and will not include the year 1884 (Python always drops the last year in a range. -==== - - -=== Question 2 (2 points) - -.. Revise your work from question 1b (before you built your function!), so that the dictionary is in reverse order. The code from the tip might help. Test your work on the years 1880 through 1883, and make sure that the resulting dictionary is in descending order. -.. Now that you know how to do the work from questions 2a, wrap your work from question 2a into a function that accepts a list of years as input, and returns a dictionary in descending order as the output. -.. Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 2a. - - -[TIP] -==== - -This code takes a dictionary called `mydict` and puts it into descending sorted order. - -`mydescendingdict = dict([key, mydict[key]] for key in sorted(mydict, key=mydict.get, reverse=True))` - -==== - - -=== Question 3 (2 points) - -[loweralpha] -.. For the year 1880, find how many rows (which are for US records) are `SNOW` days with a positive amount of snowfall. In other words look for rows with three conditions: The first field starts with `US`, and the `element_code` is `SNOW`, and the `value` is strictly positive. Then do this again (in separate cells) for the years 1881, 1882, 1883. -.. Now make a dictionary that has these years as the four keys, and has the counts (that you discovered in question 3a) as the values. -.. Now that you know how to do the work from questions 3a and 3b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output. -.. Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 3a and 3a. - - -=== Question 4 (2 points) - -[loweralpha] -.. For the year 1880, consider only the types of rows from question 3a (which are for US records, with `element_code` as `SNOW`, and with `value` as strictly positive). Group those rows according to the `id`, and determine which `id` has the largest number of snowfall days. Then do this again (in separate cells) for the years 1881, 1882, 1883. -.. Now make a dictionary that has these years as the four keys, and has the `id` values with the largest number of snowfall days in each of these individual years. -.. Now that you know how to do the work from questions 4a and 4b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output. -.. Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 4a and 4a. - - -=== Question 5 (2 points) - -[loweralpha] -.. For the year 1880, consider only the types of rows from question 3a/4a (again, which are for US records, with `element_code` as `SNOW`, and with `value` as strictly positive). Group those rows according to the `id`, and determine which `id` has the largest *amount* of snowfall (in other words, `sum` the snowfall amounts for each `id`). Then do this again (in separate cells) for the years 1881, 1882, 1883. -.. Now make a dictionary that has these years as the four keys, and has the `id` values with the largest *amount* of snowfall in each of these individual years. -.. Now that you know how to do the work from questions 5a and 5b, wrap your work into a function that accepts a list of years as input, and returns a dictionary as the output. -.. Run the function for the years in the range from 1880 to 1883 (inclusive), and make sure that the results agree with your results from question 5a and 5b. - - - - - -Project 07 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project07.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project07.py` -* Submit files through Gradescope -==== - - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project08.adoc deleted file mode 100644 index c3f664e9b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project08.adoc +++ /dev/null @@ -1,97 +0,0 @@ -= TDM 10200: Project 8 -- 2024 - -**Motivation:** We will continue to introduce functions and visualization - -**Context:** Write functions with visualizations - -**Scope:** python, functions, pandas, matplotlib, Parquet columnar storage file format - - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://the-examples-book.com/programming-languages/python/writing-functions[Python Functions] -- https://the-examples-book.com/programming-languages/python/pandas-series[pandas Series] -- https://the-examples-book.com/programming-languages/python/pandas-aggregate-functions[pandas aggregation functions] - - -== Datasets - -`/anvil/projects/tdm/data/whin/weather.parquet` - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/some-examples-for-TDM-10200-project-8[eleven new videos] to help you with Project 8. -==== - -[WARNING] -==== -You need to use 3 cores for your Jupyter Lab session for Project 8 this week. -==== - -[TIP] -==== -You can use `pd.read_parquet` to read in a `parquet` file (very similarly to how you use `pd.read_csv` to read in a `csv` file!) - -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -== Questions - - -=== Question 1 (2 points) - -Read the file into a DataFrame called `myDF`. - -.. Convert the `observation_time` column to into a `datetime` type. -.. Create 3 new columns for the `year`, `month` and `day`, based on the column `observation_time`. -.. For a given `station_id`, calculate the average month-and-year-pair temperatures (from the column `temperature`) for that `station_id`. Try this for a few different `station_id` values. -.. Now write a function called `get_avg_temp` that takes one `station_id` as input and returns the average month-and-year-pair temperatures (associated with that specific `station_id`). Make sure that the results of your function match with your work from question 1c. - -=== Question 2 (2 points) - -For this function, be sure to `import matplotlib.pyplot`. - -We will use the function from question 1d to make some line plots. - -.. For a given `station_id`, create a line plot, with one line for each year. Try this for a few different `station_id` values. -.. Now that you are sure your analysis from 2a works well, wrap your work from question 2a into a function that takes a `station_id` as input, and creates a line plot, with one line for each year (for the average month-and-year-pair temperatures from that `station_id`). - -=== Question 3 (2 points) - -.. Revisit the function from question 1d, to find the maximum temperature (instead of the average temperature) in each month-and-year-pair, for a given station. As before, you should test this for several examples before you build the function, and then make sure your function matches your examples. -.. Revisit the function from question 2b, to make a function that takes one `station_id` as input and it creates a bar plot (instead of a line plot), depicting the maximum temperature in each month-and-year-pair (instead of the average temperature). - -[TIP] -==== -Your work from question 3b can utilize the function you build in question 3a. -==== - -=== Question 4 (2 points) - -.. For a given `station_id`, create a box plot that shows the month-by-month wind speeds in 2020 for that specified `station_id`. Try this for a few different `station_id` values. -.. Write a function that takes a `year` (not necessarily 2020) and a `station_id` as inputs, and the function creates a box plot about the month-by-month wind speeds in that specific year (not necessarily 2020), at the specified `station_id`. - - -=== Question 5 (2 points) - -.. Explore the dateset and find something interesting, like (for instance) something about the wind speed, pressure, soil temperature, etc., and do some analysis. -.. Make a visualization that shows one or more plots about your analysis. -.. Wrap the work 5a and 5b into a function that can be used to create the visualizations in a systematic way, and test the function with the same inputs used in 5a and 5b. - -Project 08 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project08.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project08.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project09.adoc deleted file mode 100644 index 32f587790..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project09.adoc +++ /dev/null @@ -1,111 +0,0 @@ -= TDM 10200: Project 9 -- Spring 2024 - - -**Motivation:** Working with pandas can be fun! Learning how to manipulate and clean up data in pandas is a helpful tool to have in your tool belt! Dive into pandas to transform and analyze data, equipping you with key data science skills. - -**Context:** Hopefully, most students are feeling pretty comfortable now, with building functions and using pandas. In this project, we will continue working with pandas. We want to get better at analyzing big datasets and solving problems with data. - -**Scope:** python, pandas - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here] -- https://www.digitalocean.com/community/tutorials/pandas-dropna-drop-null-na-values-from-dataframe[pandas drop Null from DataFrame] - -== Dataset - -`/anvil/projects/tdm/data/whin/weather.csv` - - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/hints-for-TDM-10200-project-9[five new videos] to help you with Project 9. -==== - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 9 this week. -==== - -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. - -This data is similar to last week's data, from Project 8, but we are using the csv file instead of the parquet file this week. - -Please take a look at the -`.head()`, `.shape`, `.dtypes`, `.info()`, `.describe()`, etc., to remind yourself about this data. Since we have explored the WHIN weather data set a little bit already, we will dive right into some explorations. - -We will assume that you read the data into a data frame called `myDF`. -==== - - - -== Questions - -=== Question 1 (2 points) - -.. Use the method `value_counts()` to get the number of records for each station. -.. Use the method `groupby()` to get the number of records for each station. -.. Explain the difference between the methodologies of using these two methods. We love getting your feedback. Which method do you prefer? - - -=== Question 2 (2 points) - -.. Find out how many null records exist in `myDF`, within each individual column. (Your answer should specify, for each column, how many null records are in that column.) -.. Now count the total number of null values in the entire data frame `myDF`. (In other words, add up the values from all of the counts in part 2a.) -.. Drop rows with any null values in `myDF`. Save the resulting cleaned data set into a new DataFrame called `myDF_cleaned`. -.. Just to make sure that you did this properly, check `myDF_cleaned` carefully: Are there any null values remaining in `myDF_cleaned`? (There should not be.) How many rows and columns are in `myDF_cleaned`? - -[TIP] -==== -There are a variety of ways to approach this question. - -- The `isnull()` method is useful to find null records. -- The `sum()` and `dropna()` methods might be useful on this question too. -==== - - -=== Question 3 (2 points) - -.. Go back to the original data frame `myDF`. Create a new data frame, by removing all rows from `myDF` in which the column `temperature` is a null value. -.. Combine the columns `latitude` and `longitude` into a new column called `location` with '_' in between. -.. Grouping the data by the `location` column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location. - -[TIP] -==== -- You may refer to https://www.statology.org/pandas-combine-two-columns/[combine columns] -- The data in the new `location` column should be strings. -==== - -[TIP] -==== -- Drop only the rows from the original data frame `myDF` that have a null value in the `temperature` column. -==== - -=== Question 4 (2 points) - -.. Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame's `temperature` column. The function should also create a new `location` column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location). - -=== Question 5 (2 points) - -.. Now considering the column called `wind_gust_speed_mph` (instead of temperature), do the work from questions 3 and 4 again. -.. Which location has the largest average value of the `wind_gust_speed_mph` ? - - -Project 09 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project09.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project09.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project10.adoc deleted file mode 100644 index 2e62a61b3..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project10.adoc +++ /dev/null @@ -1,155 +0,0 @@ -= TDM 10200: Project 10 -- Spring 2024 - -**Motivation:** NumPy is the foundation that Pandas is built on. Mastering NumPy's numerical operations will enrich your understanding for numerical operations and data analysis. - -**Context:** Hopeful you have a solid foundation and understanding of data analysis in Python, and a good introduction to Pandas. In this project we will delve into NumPy, enhancing your skill set for high performance computing, in situations where you do not use Pandas. - -**Scope:** Python, pandas, NumPy - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here] -- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html[Pandas read_csv] -- https://numpy.org/devdocs/user/index.html[NumPy user guide] -- https://numpy.org/devdocs/reference/index.html[Numpy reference] - -== Dataset - -`/anvil/projects/tdm/data/flights/2014.csv` - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 10 this week. -==== - -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/hints-for-TDM-10200-project-10[six new videos] to help you with Project 10. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: `/anvil/projects/tdm/data/flights/2014.csv` -==== - - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. The dataset is 2.5 G. For this project, we will only need the following columns, so let us create a DataFrame with only those columns with corresponding data types. - -[source,python] ----- -cols = [ - 'DepDelay', 'ArrDelay', 'Distance', - 'CarrierDelay', 'WeatherDelay', - 'DepTime', 'ArrTime', 'Diverted', 'AirTime' -] - -col_types = { - 'DepDelay': 'float64', - 'ArrDelay': 'float64', - 'Distance': 'float64', - 'CarrierDelay': 'float64', - 'WeatherDelay': 'float64', - 'DepTime': 'float64', - 'ArrTime': 'float64', - 'Diverted': 'int64', - 'AirTime': 'float64' -} ----- -[TIP] -==== -- You may refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html[pandas.read_csv] to know more about read only specific columns -==== - -=== Question 2 (2 points) -.. Use `to_numpy()` to create a numpy array called `mydelays`, containing the information from the column `DepDelay`. -.. Display the shape and data type in `mydelays`. -.. Use `nan_to_num()` to replace all null values in `mydelays` to 0. -.. It can be helpful to know how to manipulate the values in an array! Find the average time in `mydelays` by calculating the numpy `mean()` of this array. Afterwards, add 15 minutes to all of the departure delay times stored in `mydelays`. Finally, use the numpy `mean()` method again, to calculate and display the average of the updated values in `mydelays`. How do these two averages compare? - -[NOTE] -==== -- The output should look something like this: - -.output ----- -The average Departure Delay before adding 15 minutes is: ....... - -The average Departure Delay after adding 15 minutes is: ....... ----- -==== - -=== Question 3 (2 points) - -.. Calculate and display the maximum arrival delay and the minimum arrival delay. - -[NOTE] -==== -- The output should look something like this: - -.output ----- -Max Arrival Delay: ...... minutes -Min Arrival Delay: ...... minutes ----- -==== - - -=== Question 4 (2 points) - -The motivation for questions 4 and 5 is to compare the times needed for calculations in pandas vs. numpy. - -In this question, first solve the following 3 questions using pandas (only). - -.. Create a data frame named `delayed_flights` that contains the information about the flights that satisfy the condition "departure delay > 60 minutes or arrival delay > 60 minutes". -.. Calculate the average distance for the flights that you found in question 4a, by taking a mean of the `Distance` column from the pandas data frame. -.. Display the time needed to calculate the time used for the calculation. - -[TIP] -==== -- You may import the `time()` library to calculate the time used, as follows: - -[source,python] ----- -import time -start_time = time.time() -.... -#your program here -.... -end_time = time.time() -print(f"Used time is {end_time - start_time}") ----- -==== - -=== Question 5 (2 points) - -Please using `numpy` methods to re-create your work from Question 4, as follows: - -.. Create 3 numpy arrays for the `DepDelay`, `ArrDelay`, and `Distance` data. -.. Filter the numpy array with the `Distance` stored in it, so that you have only the Distances that satisfy the condition that 'departure delay > 60 minutes or arrival delay > 60 minutes' -.. Use numpy `mean()` to calculate the average distances from question 5b. (Your solution should be the same as the average you obtained in question 4b.) -.. How long does the program take to get the average? -.. Please state your understanding of pandas vs. numpy from Question 4 and 5 in one or two sentences. - - -Project 10 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project10.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project10.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project11.adoc deleted file mode 100644 index e49c19e1b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project11.adoc +++ /dev/null @@ -1,95 +0,0 @@ -= TDM 10200: Project 11 -- Spring 2024 - - -**Motivation:** Learning classes in Python - -**Scope:** Object Oriented Python - -**Scope:** Python, python class, pandas - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here] -- https://the-examples-book.com/programming-languages/python/classes[python classes] -- https://www.programiz.com/python-programming/class[python objects and classes] -- https://docs.python.org/3/library/datetime.html[python datetime] - -== Dataset - -`/anvil/projects/tdm/data/flights/2014.csv` - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 11 this week. -==== - -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/hints-for-TDM-10200-project-11[six new videos] to help you with Project 11. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: `/anvil/projects/tdm/data/flights/2014.csv` -==== - - -== Questions - -=== Question 1 (2 points) - -[loweralpha] - -.. Create a class named `Flight`, which contains attributes for the flight number, origin airport ID, destination airport ID, departure time, arrival time, departure delay, and arrival delay. -.. Add a function called `get_arrdelay()` to the class, which gets the arrival delay time. - -=== Question 2 (2 points) - -.. Create a DataFrame named `myDF`, to store data from the `2014.csv` data set. It suffices to import (only) the columns listed below, and to (only) read in the first 100 rows. Although we provide the `columns_to_read`, please make (and use) a dictionary of `col_types` like we did in Question 1 of Project 10. -.. Load the data from `myDF` into the Flight class instances. (When you are finished, you should have a list of 100 Flight instances.) - -[source,python] ----- -columns_to_read = [ - 'DepDelay', 'ArrDelay', 'Flight_Number_Reporting_Airline', 'Distance', - 'CarrierDelay', 'WeatherDelay', - 'DepTime', 'ArrTime', 'Origin', - 'Dest', 'AirTime' -] ----- - - -=== Question 3 (2 points) - -.. Create an empty dictionary named `delays_dest`. Then use a for loop to assign values to `delays_dest` from the 100 Flight objects. -.. Calculate the average arrival delay time for each destination airport, and save the result to a dictionary named `average_delays` - -=== Question 4 (2 points) - -.. Create a function called `arr_avg_delays` based on the steps from Question 3. This function should have a collection of Flight objects as the input. The function should output a dictionary containing the average arrival delays for each destination airport. -.. Run the function using the 100 Flight instances from Question 2 as input. - -=== Question 5 (2 points) - -.. Update the class `Flight` to add a method named `get_depdelay()` to the class. -.. Create a function called `dep_avg_delays`, similar to the `arr_avg_delays`. This function should have a collection of Flight objects as the input. It should use the average departure delay (instead of the average arrival delays), and it should do this for each origin airport (instead of each destination airport). -.. Run the function using the 100 Flight instances from Question 2 as input. - - - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project11.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project11.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12-teachingprograming_backup.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12-teachingprograming_backup.adoc deleted file mode 100644 index eb14a92d1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12-teachingprograming_backup.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= TDM 10200: Project 12 -- Spring 2024 - - -**Motivation:** Data analysis with Pandas and numpy - -**Scope:** python, pandas, numpy, apply, lambda, pandas query - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Dataset(s) - -'/anvil/projects/tdm/data/amazon - - -=== Question 1 (2 points) - -[loweralpha] -.. Read in the twoReviews.csv from last project and name it two_dfs. Using rename(),and dictionary to rename the column names. You may refer to rename [here] dictionary [here ] -... From "HelpfulnessNumerator" to "HelpfulnessNum" -... From "HelpfulnessDenominator"to "HelpfulnessDen" - - -=== Question 2 (2 points) -.. Use rename() and lambda function to rename the column names to all lowercase and remove "fulness" from the two columns. You may refer to apply [here] lambda function [here ] -... From "HelpfulnessNum" to "helpfulNum" -... From "HelpfulnessDen" to "helpfulDen" - - - -=== Question 3 (2 points) - -.. Use apply() and lambda function to create a new data field "text_len" that contains the text length for each review - average score for for each product -.. Use apply and lambda function to create a new data field "help_ratio" that contains help_ratio calculated from helpnum and helpden -.. create a new dataframe the only contains help_ration greater than 0.8 and less than 0.9. How many records you get? - - - - -=== Question 4 (2 points) -.. Create a year field to only contain year from new_time field -.. Use pandas query to select only year 2012 from two_dfs and assign to df_2012, then drop the column 'year' -.. Use pandas query to select only score less than 3 from df_2012 and assign to df_2012_3u -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project12.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12.adoc deleted file mode 100644 index f34a98e32..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project12.adoc +++ /dev/null @@ -1,91 +0,0 @@ -= TDM 10200: Project 12 -- Spring 2024 - - -***Motivation:** Learning classes in Python - -**Scope:** Object Oriented Python - -**Scope:** Python, python class, pandas - -== Reading and Resources - -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here] -- https://the-examples-book.com/programming-languages/python/classes[python classes] -- https://www.programiz.com/python-programming/class[python objects and classes] -- https://docs.python.org/3/library/datetime.html[python datetime] - -== Dataset - -`/anvil/projects/tdm/data/flights/2014.csv` - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 11 this week. -==== - -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -++++ - -++++ - -== Questions - -=== Question 1 (2 points) - -[loweralpha] - -.. In the previous project, you created a class named `Flight`, which contains attributes for the flight number, origin airport ID, destination airport ID, departure time, arrival time, departure delay, and arrival delay. Now let us use this class as a base class. Create a new subclass called `ScheduledFlight`. Add 2 more attributes to this new subclass: `CRSDepTime` and `CRSArrTime`. -.. Add a method called `is_ontime()` to the class, which returns a boolean value that indicates if the flight departs on time and arrives on time. - -=== Question 2 (2 points) - -.. Create a DataFrame named `myDF`, to store data from the `2014.csv` data set. It suffices to import (only) the columns listed below, and to (only) read in the first 100 rows. Although we provide the `columns_to_read`, please make (and use) a dictionary of `col_types` like we did in Question 1 of Project 10. -.. Load the data from `myDF` into the ScheduledFlight class instances. (When you are finished, you should have a list of 100 ScheduledFlight instances.) - -[source,python] ----- -columns_to_read = [ - 'DepDelay', 'ArrDelay', 'Flight_Number_Reporting_Airline', 'Distance', - 'CarrierDelay', 'WeatherDelay', 'CRSDepTime', 'CRSArrTime', - 'DepTime', 'ArrTime', 'Origin', - 'Dest', 'AirTime' -] ----- - - -=== Question 3 (2 points) - -.. Create an empty dictionary named `ontime_count`. Then use a for loop to assign values to `ontime_count` from the 100 ScheduledFlight objects. -.. Calculate the total number of flights that were on time, for each destination airport. - -=== Question 4 (2 points) - -.. Add a method called `is_delayed()` to the class that indicates if the flight was delayed (either had a departure delay or an arrival delay). -.. Calculate the total number of delayed flights, for each destination airport. - - -=== Question 5 (2 points) - -.. Create a subclass of your own, with at least one method, and then use the dataset to get some meaningful information that uses this subclass. - - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project12.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project13.adoc deleted file mode 100644 index 7b7f840e2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project13.adoc +++ /dev/null @@ -1,166 +0,0 @@ -= TDM 10200: Project 13 -- Spring 2024 - -**Motivation:** Flask is a web application framework. It provides tools, libraries, and technologies to build a web application. - -**Context:** Create simple conceptual webpage using `Flask` - -**Scope:** Python, Flask, Visual Studio Code - -.Learning Objectives -**** -- Create a development environment to building a web application on Anvil -- Develop skills and techniques to create a webpage using `Flask` (and also using `Visual Studio Code`) -**** - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide] -- https://code.visualstudio.com/docs/introvideos/codeediting[Code Editing in Visual Studio Code] -- https://flask.palletsprojects.com/en/3.0.x/tutorial/[flask tutorial] -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] - -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], launch VS Code on Anvil -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], including the Initial Configuration of VS Code, and install the Python extension for Visual Studio -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], select the Python interpreter path to `/anvil/projects/tdm/apps/lmodbin/python-seminar/python3` -.. Open a new terminal in Visual Studio Code. Show the current path for the Python interpreter by typing `which python3` (it should show `/anvil/projects/tdm/apps/lmodbin/python-seminar/python3` for the output). - -(You can stop reading when you get to the section on "Debugging Python Code" in the starter guide.) - -[TIP] -==== -For Question 1a, follow the sections of "How can I launch VS Code on Anvil?" Choose the following options: - - - Allocation: "cis220051" - - CPU Cores: 1 - - Starting Directory: "Home Directory" - -If you get "Do you trust the authors of the files in this folder" window pop up, click "Yes, I trust the authors" -==== - -[TIP] -==== -Refer to https://code.visualstudio.com/docs/introvideos/codeediting[Code Editing in Visual Studio Code] to see how to edit and run code -==== - -=== Question 2 (2 points) - -.. Create a python file in Visual Studio Code called `helloWorld.py`, simply to create and display a greeting sentence -.. Run the file in the terminal using: `python3 helloWorld.py` - - -=== Question 3 (2 points) - -.. After you setup `Visual Studio Code`, create a simple flask application that only defines a Flask instance; call it `app`. It should look like this (make sure that you get the indenting correct: -+ -[source] ----- -from flask import Flask - -app = Flask(__name__) - -if __name__ == '__main__': - app.run(host='0.0.0.0', port=8050, debug=True) ----- -+ -.. Run the python file - - -[NOTE] -==== -- A popup window will show the message: "Your application running on port 8050 is available...". Click the "Open in Browser" button to open a browser. -- You might need to change the port (say, to 8051, or 8052, etc., depending on how many other students are trying it at the same time!). -- There will be an error displayed, when you click the "Open in Browser", as discussed in Question 4. -==== - -=== Question 4 (2 points) - -.. You should get a 404 page - "Not Found page" for your Question 3, since you have not (yet) defined a route for the home page. The browser likely will say "The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again." Now update the code to add a route for the home page, as shown in the tip below. -+ -[TIP] -==== -- Use a decorator for the home ("/") route, followed by a function you would like to run. It is ok if you just output a greeting message like "Hello World!" This should be added to your file, just before the if statement. - -[source,python] ----- -@app.route('/') -def hello(): - return "Hello World!" ----- -==== -+ -.. Run the python file - -[TIP] -==== -If you already ran something in the Python terminal and you want it to stop running, you can click into the terminal and type Control-C. -==== - -=== Question 5 (2 points) - -.. You should get your message in a webpage in Question 4. Now let us (separately) add an html file, instead of putting the content for the message inside the python file. Create a folder named `templates`, in the same location where your python file is located, and then create a simple html file named `index.html` in the folder, to hold the greeting message. (Or you may create a fancy html page, but you do not need to make it fancy!) -.. Update your python file root decorator to render the information to the webpage from `index.html` -+ -[TIP] -==== -- Flask's function called `render_template` is useful. You can import it by the following statement -[source,python] ----- -from flask import Flask -from flask import render_template ----- -and you can change your definition of `hello` like this: -[source,python] ----- -def hello(name=None): - return render_template('index.html', name=name) ----- -==== -+ -.. Run the python file - -[TIP] -==== -Your `index.html` needs to be inside a `templates` directory, and that `templates` directory needs to be in the same place where your application is saved. For instance, Dr Ward's file is located here: -`/home/x-mdw/templates/index.html` - -For instance, Dr Ward's html page looks like this: - -[source] ----- - - -The head of my example page. - - -Greetings from Dr Ward and Cookie Monster! - - ----- -==== - - - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project13.py` - -* Submit files through Gradescope -==== -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14-teachingprograming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14-teachingprograming.adoc deleted file mode 100644 index 2b487d4e6..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14-teachingprograming.adoc +++ /dev/null @@ -1,55 +0,0 @@ -= TDM 10200: Project 14 -- Spring 2024 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** python, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -[loweralpha] - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, or package. - - -=== Question 3 (1 pt) - -.. Reflecting on data visualization questions that you have done, which one do you consider most appealing? Please provide an example from one question that you completed. You may refer to the question, and screenshot your graph. - -=== Question 4 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 5 (1 pt) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 6 (2 pts) - -.. Please identify 3 skills or topics related to the python language that you want to learn. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14.adoc deleted file mode 100644 index ffecce555..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-project14.adoc +++ /dev/null @@ -1,45 +0,0 @@ -= TDM 10200: Project 14 -- Spring 2024 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - - -== Questions - - -=== Question 1 (2 pts) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (2 pts) - -.. Reflecting on your experience working with different commands, functions, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, or package. - - -=== Question 3 (2 pts) - -.. While working on the projects, including loops, functions, and classes, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 4 (2 pts) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 5 (2 pts) - -.. Please identify 3 skills or topics related to the Python language that you want to learn in the future. For each, please provide an example that illustrates your interests, and the reason that you think they would be beneficial. - - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-projects.adoc deleted file mode 100644 index d7803ae84..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/10200/10200-2024-projects.adoc +++ /dev/null @@ -1,49 +0,0 @@ -= TDM 10200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$10200-2024-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2023 to Spring 2024. -==== - -=== Sign up -// Need to replace Piazza Links (Maybe?) - -https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2022/tdm10100] - -=== Link -// Need to replace Piazza Links (Maybe?) - -https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2022/tdm10100/home] - - -== Syllabus - -Navigate to the xref:spring2024/logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project01.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project01.adoc deleted file mode 100644 index 5300ba579..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project01.adoc +++ /dev/null @@ -1,117 +0,0 @@ -= TDM 20200: Project 1 -- 2024 - -**Motivation:** Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like https://arrow.apache.org/faq/[parquet] and https://developers.google.com/protocol-buffers[protobufs] are becoming more common. - - -**Context:** In this project we will use the `lxml` package in Python. This is a first project focusing on web scraping - -**Scope:** python, XML - -.Learning objectives -**** -- Review and summarize the differences between XML and HTML/CSV. -- Match XML terms to sections of XML demonstrating working knowledge. -**** - -== Readings and Resources - -[NOTE] -==== -https://lxml.de[This] link will show you more information about lxml - -Check out https://thedatamine.github.io/the-examples-book/projects.html#p01-290[this old project] that uses a different dataset -- you may find it useful for this project. - -Please checkout https://the-examples-book.com/programming-languages/python/lxml[this] Example Book lxml introduction. -==== - -[IMPORTANT] -==== -We also added some specific examples about using `lxml` to parse the XML files in the `otc` data set. - -https://the-examples-book.com/programming-languages/python/lxml-otc-examples[`lxml` examples with the `otc` data set] - -These examples can be especially helpful for you as you work on Project 1. - -We also added https://the-examples-book.com/programming-languages/python/lxml-examples-TDM-20200-Project1-Spring-2024[5 more videos for you, which are specifically about Project 1] - -==== - - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/otc/valu.xml` - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Please use the `lxml` package to find and print the name of the root element's tag for the XML file of valu.xml - -=== Question 2 (2 points) - -.. Please use 'xpath' with namespace to find the 'title' element. - -[TIP] -==== -There are 13 title elements altogether. Please show the contents of the very first title element, since this is the title of the page. -==== - -[IMPORTANT] -==== -You will need to use namespace for xpath; otherwise, you won't get the element. -https://lxml.de/xpathxslt.html[This link] will show you more information using about XPath with lxml -==== - -=== Question 3 (2 points) - -.. Please use 'xpath' with namespace to find and list all child elements directly under the 'document' element in the xml file. - -=== Question 4 (2 points) - -.. Please get and list all author elements, including their child elements and attributes. - - -[TIP] -==== -To print an `Element` object, you may refer to the following sample code. You will need to modify the example code to fit to your code. - -[source,python] ----- -print(etree.tostring(my_element, pretty_print=True).decode('utf-8')) ----- -==== - -=== Question 5 (2 points) - -.. Please list all codeSystem attribute values from the file. -.. Please list the codeSystem value for which the 'displayName' attribute contains the string 'DOSAGE'. - -[TIP] -==== -You can use the `.attrib` attribute to access the attributes of an `Element`. It is a dict-like object, so you can access the attributes similarly to how you would access the values in a dictionary. -==== - -[TIP] -==== -https://stackoverflow.com/questions/6895023/how-to-select-xml-element-based-on-its-attribute-value-start-with-heading-in-x/6895629[This] link may help you when figuring out how to select the right elements -==== - -Project 01 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project01.ipynb`. -* Python file with code and comments for the assignment - ** `firstname-lastname-project01.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project02.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project02.adoc deleted file mode 100644 index 2073dfd14..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project02.adoc +++ /dev/null @@ -1,158 +0,0 @@ -= TDM 20200: Project 02 -- 2024 - -**Motivation:** Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will use the website of "books.toscrape.com" to practice scraping skills - -**Context:** In the previous project we gently introduced XML and XPath to parse a XML document. In this project, we will introduce web scraping. We will learn some basic web scraping skills using BeautifulSoup. - -**Scope:** Python, web scraping, BeautifulSoup - -.Learning Objectives -**** -- Understand webpage structures -- Use BeautifulSoup to scrape data from web pages -**** - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://www.dataquest.io/blog/tutorial-an-introduction-to-python-requests-library/[This link] will provide you more information about python requests library -- https://www.crummy.com/software/BeautifulSoup/bs4/doc/[This link] will provide you more information about BeautifulSoap -==== - -[IMPORTANT] -==== -Dr Ward created 9 videos to help with this project. You might want to take a look at these videos here: - -https://the-examples-book.com/programming-languages/python/books-to-scrape-examples -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Please use BeautifulSoup to get and display the website's HTML source code https://books.toscrape.com[https://books.toscrape.com] -.. Review the website's HTML source code. What is the title for that webpage? - -[TIP] -==== -You may refer to the following to import libraries, modify the code to fit into yours -[source,python] -import requests -from bs4 import BeautifulSoup -...# define url -response = requests.get(url) -soup = BeautifulSoup(response.content,'html.parser') -==== - -=== Question 2 (2 points) - -.. Please use the BeautifulSoup library to get and display all categories' names from the homepage of the website. - -[TIP] -==== -- Review the page source code, to find "categories" located at the sidebar, under a `div` tag with class `nav-list`. The BeautifulSoup `select` method is useful to get such category, like this: - -[source,python] -soup.select('.nav-list li a') -==== - - -=== Question 3 (2 points) - -.. Now, instead of only getting the names of the categories, get all of the category links from the homepage as well. -+ -[TIP] -==== -- Review the homepage source code, to explore where are the category links located. You may use `find` to get the `div` tags. - -[source,python] -soup.find('div',class_ = 'side_categories') - -- Under a `div` tag, the links can be found in the "a" tag. You may use `find_all` to get all category links. -[source,python] -find_all('a') - -- You may refer to the following code, to exclude "books" from the category list, since they are not part of the categories. -- Assume `link` holds the link information for a category. - -[source,python] -link.get('href').startswith("catalogue/category/books/") - -- The format of the output for this question should look like: - ----- -Category: Travel,link: catalogue/category/books/travel_2/index.html -Category: Mystery,link: catalogue/category/books/mystery_3/index.html -.... ----- -==== - -.. Update the code from question 3a to get (only) the links for books with the category "Romance". - -[TIP] -==== -- Output will be like ----- -romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html ----- -==== - -=== Question 4 (2 points) - -.. Use the "Romance" link https://books.toscrape.com/catalogue/category/books/romance_8/index.html[https://books.toscrape.com/catalogue/category/books/romance_8/index.html] from Question 3b to get the webpage source code for the Romance category web page. -.. Display all book titles in the first page of the romance category. - - - -=== Question 5 (2 points) - -If you look at this page: - -http://books.toscrape.com/catalogue/category/books/romance_8/index.html[http://books.toscrape.com/catalogue/category/books/romance_8/index.html] - -you can see, in the lower-right-hand corner, that the link to the second page is: - -https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html[https://books.toscrape.com/catalogue/category/books/romance_8/page-2.html] - -Now temporarily forget that you know this fact! We want you to try to find this page-2 link in the Romance book page. - - -.. Starting with http://books.toscrape.com/catalogue/category/books/romance_8/index.html[http://books.toscrape.com/catalogue/category/books/romance_8/index.html], please find the page 2 link the from Romance category web page using BeautifulSoup. -+ -[TIP] -==== -The following is some sample code, for your reference. - -[source,python] ----- -# need to remove last part from basic url -url= "http://books.toscrape.com/catalogue/category/books/romance_8/index.html" -url=url.rsplit('/',1)[0] -# Assume you get next hyperlink ""category/page-2.html" as the next page, you need to only keep the last part -next_link = next.split("/")[-1] -#Combine -next_url=url+"/"+next_link ----- -==== -.. List the titles of all of the books from the second page of the "Romance" category. - - - -Project 02 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project02.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project02.py` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project03.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project03.adoc deleted file mode 100644 index bb28cbbab..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project03.adoc +++ /dev/null @@ -1,142 +0,0 @@ -= TDM 20200: Project 03 -- 2024 - -**Motivation:** Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will continue to use the website of "books.toscrape.com" to practice scraping skills - -**Context:** In the previous projects we gently introduced XML and XPath to parse a XML document, also introduced some basic web scraping skills using ""BeautifulSoup". In this project, we will learn some basic skills to scrap with another library `selenium` - -**Scope:** Python, web scraping, BeautifulSoup, selenium - -.Learning Objectives -**** -- Understand webpage structures -- Use selenium to scrape data from web pages -**** - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- Please check out https://the-examples-book.com/programming-languages/python/selenium[this page from The Examples Book] about selenium. -==== - -[IMPORTANT] -==== -Dr Ward created 3 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/more-books-to-scrape-examples - -Project 3 is very similar to Project 2 BUT we are using Selenium in this project instead of BeautifulSoup. -==== - -== Questions - -In this project, we will re-create the results from Project 2, BUT in this project, we will use Selenium instead of Beautiful Soup. - -=== Question 1 (2 points) - -[loweralpha] -.. Please use selenium to get and display the website's HTML source code https://books.toscrape.com[https://books.toscrape.com] -.. Review the website's HTML source code. What is the title for that webpage? - -[TIP] -==== -- You may import the following libraries/ modules. For this project, we assume that you are using the Firefox browser. (If you are not using Firefox, please download and use Firefox on this project.) -[source,python] -from selenium import webdriver -from selenium.webdriver.common.by import By -from selenium.webdriver.firefox.service import Service -from webdriver_manager.firefox import GeckoDriverManager -from selenium.webdriver.firefox.options import Options -==== -[TIP] -==== -- You may refer to the following options, to configure a Firefox instance, to run efficiently. -[source,python] -firefox_options = Options() -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") -==== -[TIP] -==== -- Use this code to create a selenium driver -[source,python] -driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()),options=firefox_options) -==== - -=== Question 2 (2 points) - -.. Please use the selenium library to get and display all categories' names from the homepage of the website. - -[TIP] -==== -- Similar to project 2, you will need to explore the webpage source code, to find the information that you need. -- You may use find_elements() with By.CSS_SELECTOR, to explore the page source, the category information located under ".nav-list>li>ul>li>a", like (for instance) the following sample code, or you may also use other ways to get information. Selenium gives you several ways to find the content that you need. -[source,python] -driver.find_elements(By.CSS_SELECTOR,".nav-list>li>ul>li>a" ) - -- You may use a for loop to iterate over all of the categories, and print only the category names. -==== - -=== Question 3 (2 points) - -.. Now, instead of only getting the names of the categories, get all of the category links from the homepage as well. - -.. Update the code from question 3a to get (only) the links for books with the category `Romance`. - -[TIP] -==== -- The output will be like: ----- -romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html ----- -- Continuing the work from Question 2, get additional information about the links, including (of course) the `.get_attribute('href')` method may be useful -- Use an `if` statement to check whether the category name `Romance`, and if so, then get the `Romance` category link -==== - -=== Question 4 (2 points) - -.. Starting from the homepage of Romance category "https://books.toscrape.com/catalogue/category/books/romance_8/index.html", please get the titles of all of the books from the `Romance` category's first webpage. -.. Find the next pagination link from the `Romance` category of the first webpage. Next, get all of the book titles, from the second page of the `Romance` category. - -[TIP] -==== -- For next page link, look for an html element with a class usually like 'pager', 'pagination' or similar for a pagination, in the Romance webpage. It will be -
- -You will need to extract the `href` attribute of a tag -==== -[TIP] -==== -- Selenium has a "click()" method, to allow you to dynamically click a link in a webpage. -==== - -=== Question 5 (2 points) - -.. In project 2 and Project 3 we used two different libraries, `BeautifulSoup` and `Selenium`, to accomplish the very similar tasks. Please briefly outline the similarities and differences between these two libraries (BeautifulSoup versus Selenium). - - - -Project 03 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project03.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project03.py` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project04.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project04.adoc deleted file mode 100644 index 7449a243b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project04.adoc +++ /dev/null @@ -1,188 +0,0 @@ -= TDM 20200: Project 4 -- 2024 - -**Motivation:** It is worthwhile to learn how to parse hundreds of thousands of files systematically. We practice this skill, step by step. - -**Context:** We return to the over-the-counter medications from Project 1, aiming to extract the ingredient substances from each medication, and creating a tally of all of the ingredient substances. - -**Scope:** Python, XML - -.Learning Objectives -**** -- Extract XML content from a large number of files and summarize some of the content in the files. -**** - - -== Dataset(s) - -The following questions will use the following dataset(s): - -- `/anvil/projects/tdm/data/otc/archive1` - -through - -- `/anvil/projects/tdm/data/otc/archive10` - - -[NOTE] -==== -- It is worthwhile to review the skills learned in Project 1, in which we analyzed the files: - -`/anvil/projects/tdm/data/otc/valu.xml` - -and - -`/anvil/projects/tdm/data/otc/hawaii.xml` - -- We already downloaded and uncompressed the 10 zip files from the website https://dailymed.nlm.nih.gov/dailymed/spl-resources-all-drug-labels.cfm for the `Human OTC Labels` data. We only stored the `.xml` files and removed the `.jpg` files (we do not need the `.jpg` files for this project). - -- The uncompressed files are stored in the directories: - -- `/anvil/projects/tdm/data/otc/archive1` - -through - -- `/anvil/projects/tdm/data/otc/archive10` - - -==== - -[NOTE] -==== -When building the dictionary, you will see that Dr Ward writes: - -[source,python] ----- -if mystring not in mydict: - mydict[mystring] = 1 -else: - mydict[mystring] += 1 ----- -Some of you will not have worked with dictionaries too much in the past. Dictionaries start out empty, and you need to add the words as you go. - -Alternatively, if you want to, you can just write: -[source,python] ----- -mydict[mystring] = mydict.get(mystring, 0) + 1 ----- -This approach is a little bit cleaner, but I didn't know if you would understand it. If `mystring` is missing from the dictionary, then `mydict.get(mystring, 0)` will return 0, so that `mydict[mystring]` is created with a value of 1. Otherwise, if `mystring` is already in the dictionary, then `mydict.get(mystring, 0)` will return the previous number of occurrences, so that `mydict[mystring]` gets bumped up by 1, to account for the new occurrence of `mystring`. - -Either approach is OK, and you might have another Pythonic way that you want to handle this step in creating the dictionary. -==== - -[IMPORTANT] -==== -Dr Ward created 8 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/analyzing-otc-ingredient-substances -==== - -== Questions - -=== Question 1 (2 points) - -Run the lines: - -[source,python] ----- -import pandas as pd -import lxml.etree -import glob ----- - -[loweralpha] -.. Remind yourself how to extract the ingredient substances from each of these two files: `/anvil/projects/tdm/data/otc/valu.xml` and `/anvil/projects/tdm/data/otc/hawaii.xml` - -For each of these two files, print a list of all ingredient substances (it is OK if some are repeated; also, do not worry about which ingredient that the ingredient substances come from). For instance, if you extract the ingredient substances from the file - -`/anvil/projects/tdm/data/otc/valu.xml` - -you should get these ingredient substances: - -[source,bash] ----- -HYPROMELLOSES -MINERAL OIL -POLYETHYLENE GLYCOL, UNSPECIFIED -POLYSORBATE 80 -POVIDONE, UNSPECIFIED -... blah blah blah ... -STARCH, CORN -SODIUM STARCH GLYCOLATE TYPE A CORN -STEARIC ACID -TITANIUM DIOXIDE -ACETAMINOPHEN ----- - -or if you extract the ingredient substances from the file - -`/anvil/projects/tdm/data/otc/hawaii.xml` - -you should get these ingredient substances: - -[source,bash] ----- -DIBASIC CALCIUM PHOSPHATE DIHYDRATE -WATER -SORBITOL -SODIUM LAURYL SULFATE -CARBOXYMETHYLCELLULOSE SODIUM, UNSPECIFIED FORM -... blah blah blah ... -WHITE WAX -MANGIFERA INDICA SEED BUTTER -ROSEMARY OIL -TOCOPHEROL -ZINC OXIDE ----- - -=== Question 2 (2 points) - -.. Use this Python code: `for myfile in glob.glob("/anvil/projects/tdm/data/otc/archive1/*.xml")[0:11]` and use also this code: `tree = lxml.etree.parse(myfile)` to loop over the first eleven files in the `archive1` directory. Print all of the ingredient substances from these first eleven files. -.. Make a Python *dictionary* (called a `dict` in Python) from the ingredient substances, keeping track of the number of times that each ingredient substance occurs. - -=== Question 3 (2 points) - -.. Convert the dictionary from question 2b to a data frame. -.. Sort the dataframe according to the counts, and print the 5 most popular ingredient substances from those 10 files, and the number of times that each of these 5 most popular ingredient substances occurs. Your output should contain: - -[source,bash] ----- -COCAMIDOPROPYL BETAINE 60 -FD&C BLUE NO. 1 70 -CITRIC ACID MONOHYDRATE 87 -GLYCERIN 93 -WATER 114 ----- - -=== Question 4 (2 points) - -.. Now analyze the first 1000 files from the `archive1` directory, and print the output that shows the 5 most popular ingredient substances from those 1000 files, and the number of times that each of these 5 most popular ingredient substances occurs. -.. Now try to analyze all of the files from the `archive1` directory. Likely, *your work will fail*, because there is at least one enormous file that needs a little bit fancier parsing method! So you can add these lines: - -[source,python] ----- -from lxml.etree import XMLParser, parse -p = XMLParser(huge_tree=True) ----- - -and then add the parameter `parser=p` to your `parse` statement. Now you can analyze all of the files from the `archive1` directory. Print output that shows the 5 most popular ingredient substances from all of the files (altogether) in the `archive1` directory, and the number of times that each of these 5 most popular ingredient substances occurs. - - -=== Question 5 (2 points) - -.. Now analyze all of the files in all 10 directories `archive1` through `archive10`, and print output that shows the 5 most popular ingredient substances from all of the files (altogether) in these 10 directories, and the number of times that each of these 5 most popular ingredient substances occurs. - -Project 04 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project04.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project04.py` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project05.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project05.adoc deleted file mode 100644 index f103823d8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project05.adoc +++ /dev/null @@ -1,220 +0,0 @@ -= TDM 20200: Project 5 -- 2024 - -**Motivation:** We practice the skill of scraping information about books from a real publisher's website. (We only do this for academic purposes and not for commercial purposes.) - -**Context:** We use the Selenium skills that we learned in Project 3 to extract information about books from the publisher O'Reilly's website. - -**Scope:** Python, XML, Selenium - -.Learning Objectives -**** -- Extract XML content from a real-world website using Selenium. -**** - - -== Dataset(s) - -The following questions will examine the No Starch Press books that are available from this website: - -`https://www.oreilly.com/search/` - -[NOTE] -==== -No Starch Press is one of the Publishers available from the dropdown menu on the left-hand-side of the website. At the moment, there are 350 books available from No Starch Press. (This might change during the project, if more books are published in the next week or two.) -==== - -[IMPORTANT] -==== -The order of the books are dynamic. For this reason, if you look at the books in a browser on your computer, at the same time that you are scraping books from the O'Reilly website, the order of the books is likely to change. -==== - -[TIP] -==== -Since there are 350 No Starch Press books available at present, if we load 100 books per webpage, we can see all 350 books by scraping these 4 webpages: - -`https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=1` - -`https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=2` - -`https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=3` - -`https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=4` -==== - - -[TIP] -==== -We prepare to work with Selenium in Python as follows: - -[source,python] ----- -from selenium import webdriver -from selenium.webdriver.common.by import By -from selenium.webdriver.firefox.service import Service -from webdriver_manager.firefox import GeckoDriverManager -from selenium.webdriver.firefox.options import Options - -firefox_options = Options() -firefox_options.add_argument("--headless") -firefox_options.add_argument("--disable-extensions") -firefox_options.add_argument("--no-sandbox") -firefox_options.add_argument("--disable-dev-shm-usage") - -driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()),options=firefox_options) ----- -==== - -and we can load the pages, for instance, here is the second page, like this: - -`driver.get("https://www.oreilly.com/search/?publishers=No%20Starch%20Press&type=book&rows=100&page=2")` - - -[IMPORTANT] -==== - -Each book entry is wrapped in the following (example) XML code. Some XML entries have been removed, so that there might be other siblings and/or children that are not shown here. This example is from the XML for the book `Rust for Rustaceans` by Jon Gjengset. - -[source,none] ----- - ----- -==== - -[WARNING] -==== -It is necessary to give each page a few seconds to load. Otherwise, the query for a page might end up blank. Therefore, it is advisable to SLOWLY load one cell at a time, when you are checking your work, waiting a few seconds in between each cell. This allows the O'Reilly pages to load in the browser. -==== - -[IMPORTANT] -==== -Dr Ward created 9 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/scraping-no-starch-press-books -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Load the formats for all 350 entries into a list of length 350, and make sure that each entry says `Format: Book`. -.. Load the publisher for all 350 entries into a list of length 350, and make sure that each entry says `Publisher: No Starch Press`. - -[NOTE] -==== -For question 1a, you can use the XPath + -`//article/div/div/span` + -in Selenium and extract the text. - -For question 1b, you can use the XPath + -`//article/div/section/div/div/a[contains(@href, '/publisher/')]` + -in Selenium and extract the text. -==== - -=== Question 2 (2 points) - -From the first page of 100 entries: - -[loweralpha] -.. Extract a list of the 100 URLs -.. Extract a list of the 100 titles -.. Extract a list of the 100 authors (it is OK if the word "By" is included in each author result) -.. Extract a list of the 100 dates -.. Extract a list of the 100 pages - -[NOTE] -==== -For the URLs, use XPath + -`//article/div/section/a[@tabindex='-1']` + -and extract the attribute `href`. - -For the titles, use XPath + -`//article/div/section/a/h3` + -and extract the text. - -For the authors, use XPath + -`//article/div/section/div/div[contains(@data-testid, 'search-card-authors')]` + -and extract the text. - -For the dates, use XPath + -`//article/div/section/div/div[contains(@data-testid, 'publication-details')]/span[not(@class)]` + -and extract the text. - -For the pages, use XPath + -`//article/div/section/footer` + -and extract the text. -==== - -=== Question 3 (2 points) - -Extract the content from pages 2, 3, and 4 (i.e., from the next 250 entries), and add this content to the lists from question 2, so that you have altogether: - -[loweralpha] -.. A list of the 350 URLs -.. A list of the 350 titles -.. A list of the 350 authors (it is OK if the word "By" is included in each author result) -.. A list of the 350 dates -.. A list of the 350 pages - -[NOTE] -==== -You might want to use a for loop, but if you do, it is worthwhile to `import time` and to `time.sleep(10)` after loading a new driver page, before extracting information from it. It is also worthwhile to `extend` the elements of one list onto another list. -==== - - -=== Question 4 (2 points) - -.. For the list of pages, remove the phrase " pages" (including the space) and the remove the commas, and then convert from strings to integers. -.. Now make a data frame of the URLs, titles, authors, dates, and (the new numeric) pages. - -=== Question 5 (2 points) - -.. If you drop the duplicates from your data frame in Question 4b, you will likely not (yet) have 350 distinct No Starch Press books. Repeat the steps above, building (say) one or two more data frames, until you have all 350 distinct titles. -.. Once you have all 350 distinct titles in a data frame, sort the results by the date column, and find which month-and-year pair had the largest number of pages written. - -[NOTE] -==== -You should find that, in June 2021, there were a total of 3096 pages written, in these 7 books: - -[source,none] ----- -https://learning.oreilly.com/library/view/-/9781098128999/ How Cybersecurity Really Works By Sam Grubb June 2021 216 -https://learning.oreilly.com/library/view/-/9781098129019/ Deep Learning By Andrew Glassner June 2021 768 -https://learning.oreilly.com/library/view/-/9781098129033/ Learn to Code by Solving Problems By Daniel Zingaro June 2021 336 -https://learning.oreilly.com/library/view/-/9781098128982/ The Art of WebAssembly By Rick Battagline June 2021 304 -https://learning.oreilly.com/library/view/-/9781098128975/ Arduino Workshop, 2nd Edition By John Boxall June 2021 440 -https://learning.oreilly.com/library/view/-/9781098129002/ Hardcore Programming for Mechanical Engineers By Angel Sola Orbaiceta June 2021 600 -https://learning.oreilly.com/library/view/-/9781098129026/ The Big Book of Small Python Projects By Al Sweigart June 2021 432 ----- - -==== - - -Project 04 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project05.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project05.py` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project06.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project06.adoc deleted file mode 100644 index 1e604613e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project06.adoc +++ /dev/null @@ -1,120 +0,0 @@ -= TDM 20200: Project 6 -- 2024 - -**Motivation:** Being able to analyze and create good visualizations is a skill that is invaluable in many fields. It can be pretty fun too! In this project, we are going to dive into plotting using `plotly`, a very popular open source graphing library that can interact graphs online. - -**Context:** Create good visualizations about housing data using `plotly`. In the next project, we will continue to learn about and become comfortable using `plotly`. - -**Scope:** Python, visualizations, plotly - -.Learning Objectives -**** -- Demonstrate the ability to create basic graphs with default settings. -- Demonstrate the ability to modify axes labels and titles. -- Demonstrate the ability to customize a plot (color, shape/linetype). -**** - - -== Dataset(s) - -The following questions will use the following dataset: - -- `/anvil/projects/tdm/data/zillow/Metro_time_series.csv` - - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- You may refer to https://plot.ly/python[plot introduction] -- Read about the plotly express functions on https://plotly.com/python/plotly-express/[this] page. -==== - -== Questions - -[IMPORTANT] -==== -Dr Ward created 4 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/plotly-examples -==== - -=== Question 1 (2 points) - -[loweralpha] -.. Read in the data and called it `myDF`. -.. Update the column `Date` to the `datetime` data type. -.. For each `Date`, sum the values in the column `InventoryRaw_AllHomes`. -.. Now plot a line chart that shows the overall housing inventory (i.e., the results from question 1c), for the date in the years 2010 through 2015. - - -[TIP] -==== -- You might choose to use `to_datetime()` to convert the data type in part b. -==== - -[WARNING] -==== -Some students turned in the previous project without executing their code in their Jupyter Lab notebook! Please remember that you need to run all of your cells before submitting your Jupyter Lab notebook. -==== - - -=== Question 2 (2 points) - -For this question, `United_States` is one of the regions, but we want you to ignore that region in this question. - -.. Using the column `InventoryRaw_AllHomes`, group the data according to the `RegionName`, and find the `mean` of this column for each `RegionName`. -.. Similar question: Using the column `AgeOfInventory`, group the data according to the `RegionName`, and find the `mean` of this column for each `RegionName`. -.. Now make a bar chart from your results in part 2a, to visualize the top 5 regions with the highest inventory of homes (on average, in those regions). -.. Similarly, use the results from 2b, to make a bar chart to visualize the top 5 regions with the oldest inventory of homes (on average, in those regions). - -[TIP] -==== -- `groupby()` and `mean()` are useful to get the average values -- `sort_values()` is useful to sort data in order -==== - -=== Question 3 (2 points) - -.. Convert your work from question 2d into a box plot, so that you can visualize the top 5 regions with the oldest inventory of homes. - -[TIP] -==== -- You may refer to the box plot session in the https://plot.ly/python[plot introduction] -==== - -=== Question 4 (2 points) - -.. Extract the data from the columns `Date` and `ZHVI_SingleFamilyResidence`, for the `RegionName` 28580. -.. Use this data to demonstrate how the Zillow Home Value Index for Single Family Homes has fluctuated in the `RegionName` 28580 during the available time period in the data set. - -[TIP] -==== -- Be certain to document your work, i.e., how did you make the plot, and what did you learn from the plot? -==== - - -=== Question 5 (2 points) - -.. Now make a data visualization of your own, using the data set. Be sure to explain why you decided to choose the data that you analyzed from this data set, and be sure to explain your reasoning. Also be sure to explain the type of visualization method that you chose. Your work should be well documented throughout this question. - -[WARNING] -==== -Just (gently) repeating the earlier warning: Some students turned in the previous project without executing their code in their Jupyter Lab notebook! Please remember that you need to run all of your cells before submitting your Jupyter Lab notebook. -==== - -Project 06 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project06.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project06.py` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project07.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project07.adoc deleted file mode 100644 index 24b5ef46d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project07.adoc +++ /dev/null @@ -1,150 +0,0 @@ -= TDM 20200: Project 7 -- 2024 - -**Motivation:** Dash is a web application framework. It can create web based visualizations. It provides a simple way to create interactive user interfaces and dashboards for data visualization. Dash uses Plotly for its plotting capabilities. - -**Context:** Create good data visualizations using `plotly` and `Dash` - -**Scope:** Python, visualizations, plotly, Dash, DashBoards, Visual Studio Code - -.Learning Objectives -**** -- Create a development environment to building a dashboard on Anvil -- Develop skills and techniques to create a dashboard using `plotly`, `dash`, `Visual Studio Code` -**** - -== Dataset(s) - -The following questions will use the following dataset: - -`/anvil/projects/tdm/data/zillow/Metro_time_series.csv` - - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- You may refer to https://plot.ly/python[plot introduction] -- Read to the plotly express functions on https://plotly.com/python/plotly-express/[this page]. -- https://dash.plotly.com/tutorial["dash tutorial"] (in Python) -- https://dash.plotly.com/layout["Dash Layout"] -- https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide] -- https://code.visualstudio.com/docs/introvideos/codeediting[Code Editing in Visual Studio Code] -==== - -[IMPORTANT] -==== -Dr Ward created 5 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/making-VS-Code-work-in-Anvil-and-Using-Dash-apps -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], launch VS Code on Anvil -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], including the Initial Configuration of VS Code, and install the Python extension for Visual Studio -.. Following the instructions on https://the-examples-book.com/starter-guides/tools-and-standards/vscode[Visual Studio Code Starter Guide], select the Python interpreter path to `/anvil/projects/tdm/apps/lmodbin/python-seminar/python3` -.. Open a new terminal in Visual Studio Code. Show the current path for the Python interpreter. - -[TIP] -==== -For Question 1a, follow the sections of "How can I launch VS Code on Anvil?" Choose the following options: - - - Allocation: "cis220051" - - CPU Cores: 1 - - Starting Directory: "Home Directory" - -If you get "Do you trust the authors of the files in this folder" window pop up, click "Yes, I trust the authors" -==== - -[TIP] -==== -Refer to the "Initial Configuration of VS Code" section to install the Python extension - -Refer to https://code.visualstudio.com/docs/introvideos/codeediting[Code Editing in Visual Studio Code] to see how to edit and run code - -Use the following command to find the path for the Python interpreter. - -[source,python] ----- -which python3 ----- -.output -/anvil/projects/tdm/apps/lmodbin/python-seminar/python3 -==== - -=== Question 2 (2 points) - -.. Create a python file in Visual Studio Code, name it "who_am_I.py", simply to create and display a paragraph of your self-introduction that you may use it at a job interview -.. Run the file through terminal - -[TIP] -==== -The following command may be used to run a python file: - -[source,python] -python3 who_am_I.py -==== - -=== Question 3 (2 points) - -[NOTE] -==== -After you setup `Visual Studio Code`, we can use it to create our Dash app. Use the provided code to find an available port and a suitable prefix for our Dash app. - -[source,python] ----- -import os -from dash import Dash, html, dcc - -vscprefix = os.getenv("VSCODE_PROXY_URI") -vscprefix = vscprefix.replace("https://ondemand.anvil.rcac.purdue.edu", "") -vscprefix = vscprefix.replace("{{port}}/", "") - -# Try many ports in this range until we find one that is free -for port in range(8050, 8500): - print(f"Trying port {port}...") - prefix=vscprefix+str(port)+'/' - - app = Dash("mytest", requests_pathname_prefix=prefix) - - app.layout = html.Div([ - html.Div(children='Hello World') - ]) - - try: - app.run(port=port, debug=False) - #If we get here, the app ran so we can exit - break - except: - continue # If we get here, the port was busy, let's try the next port ----- -==== -.. Run the code above and make sure that you can see your Dash app appear in a window on your Firefox browser. - - -=== Question 4 (2 points) - -.. Close the Dash app window that you opened in Question 3. Copy the file from Question 3 into a new file for Question 4. Now add a few more lines of text output to the Dash app, which display your self-introduction online (anvil local hosted webpage). - -=== Question 5 (2 points) - -.. Close the Dash app window that you opened in Question 4. Copy the file from Question 4 into a new file for Question 5. Now please create a dash app to do Project 6 question 2d: "make a bar chart to visualize the top 5 regions with the oldest inventory of homes (on average, in those regions)". - - -Project 07 Assignment Checklist -==== -* 4 Python files: one file for each of Question 2, Question 3, Question 4, Question 5 -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== - diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project08.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project08.adoc deleted file mode 100644 index 0862a985e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project08.adoc +++ /dev/null @@ -1,118 +0,0 @@ -= TDM 20200: Project 8 -- 2024 - -**Motivation:** Spark uses a distributed computing model to process data, which means that data is processed in parallel across a cluster of machines. PySpark is a Spark API that allows you to interact with Spark through the Python shell, or in Jupyter Lab, or in DataBricks, etc. PySpark provides a way to access Spark's framework using Python. It combines the simplicity of Python with the power of Apache Spark. - -**Context:** Understand components of Spark's ecosystem that PySpark can use - -**Scope:** Python, Spark SQL, Spark Streaming, MLib, GraphX - -.Learning Objectives -**** -- Develop skills and techniques to use PySpark to read a dataset, perform transformations like filtering, mapping and execute actions like count, collect -- Understand how to use PySpark SQL to run SQL queries -**** - -== Dataset(s) - -The following questions will use the following dataset: - -- `/anvil/projects/tdm/data/whin/weather.parquet` - - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://the-examples-book.com/starter-guides/data-engineering/containers/pyspark[PySpark] -- https://spark.apache.org/docs/latest/[Apache Spark] -- https://sparkbyexamples.com/[Spark Examples] -- https://www.analyticsvidhya.com/blog/2022/10/most-important-pyspark-functions-with-example/[PySpark Examples] -==== - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 8 this week. -==== - - -[IMPORTANT] -==== -Dr Ward created 5 videos to help with this project. - -https://the-examples-book.com/programming-languages/python/introductiontoSparkSQL -==== - -== Questions - -=== Question 1 (2 points) - -.. Run the example from https://the-examples-book.com/starter-guides/data-engineering/containers/pyspark[Example book - PySpark], in Pandas and then in Spark. Make sure to show the time used by each of these methods. - -.. Comment on the different speeds needed, to process data using Pandas vs PySpark. - - -=== Question 2 (2 points) - -.. Run the following code to initiating a PySpark application. The name of our PySpark session is `sp` but you may use a different name. This is the entry point to using Spark's functionality with a DataFrame and with the SQL API. -.. Read the file `/anvil/projects/tdm/data/whin/weather.parquet` into a PySpark DataFrame called `myDF` -.. Show the first 5 rows of the resulting PySpark DataFrame. - -[source,python] ----- -import pyspark -from pyspark.sql import SparkSession -from pyspark.sql.functions import * -sp = SparkSession.builder.appName('TDM_S').config("spark.driver.memory", "2g").getOrCreate() ----- - -=== Question 3 (2 points) - -.. List the DataFrame's column names and data types -.. How many rows are in the DataFrame? -.. How many unique `station_id` values are in the data set? - -[TIP] -==== -- The `printSchema()` function is useful to explore a DataFrame's structure. -- You may use the `select()` function to select the column, and you can use the `distinct()` and `count()` functions to get the distinct values. -==== - -=== Question 4 (2 points) -.. Create a Temporary View called `weather` from the PySpark DataFrame `myDF`. -.. Run a SQL Query to get the total number of records for each station. -.. Run a SQL Query to get the maximum wind speed recorded by each station. -.. Run a SQL Query to the average temperature recorded by each station. - -[TIP] -==== -- `createOrReplaceTempView()` is useful for part 4a. -==== - -[TIP] -==== -- You may refer to `sp.sql()` -- Use `GROUP BY station_id` to group together the data from each `station_id` before performing the Spark SQL query. -==== - -=== Question 5 (2 points) - -.. Explore the DataFrame, and run 2 SQL Queries of your own choosing. Explain the meaning of your two queries and what you learn from the queries. - - -Project 08 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and outputs for the assignment - ** `firstname-lastname-project08.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project08.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project09.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project09.adoc deleted file mode 100644 index ba89d8147..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project09.adoc +++ /dev/null @@ -1,204 +0,0 @@ -= TDM 20200: Project 9 -- 2024 - -**Motivation:** Spark uses a distributed computing model to process data, which means that data is processed in parallel across a cluster of machines. PySpark is a Spark API that allows you to interact with Spark through the Python shell, or in Jupyter Lab, or in DataBricks, etc. PySpark provides a way to access Spark's framework using Python. It combines the simplicity of Python with the power of Apache Spark. - -**Context:** This is the second project in which we will continue to understand components of Spark's ecosystem that PySpark can use - -**Scope:** Python, Spark SQL, Spark Streaming - -.Learning Objectives -**** -- Develop skills and techniques to use PySpark to read a dataset, perform transformations like filtering, mapping and execute actions like count, collect -- Understand how to use Spark Streaming -**** - -== Dataset(s) - -The following questions will use the following dataset: - -`/anvil/projects/tdm/data/amazon/amazon_fine_food_reviews.csv` - - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://the-examples-book.com/starter-guides/data-engineering/containers/pyspark[PySpark] -- https://spark.apache.org/docs/latest/[Apache Spark] -- https://sparkbyexamples.com/[Spark Examples] -- https://www.analyticsvidhya.com/blog/2022/10/most-important-pyspark-functions-with-example/[PySpark Examples] -- https://spark.apache.org/docs/3.1.3/api/python/index.html[PySpark Documentation] -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/hints-for-TDM-20200-project-9[five new videos] to help you with Project 9. -==== - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 9 this week. -==== - - -== Questions - -=== Question 1 (2 points) - -.. Create a PySpark session, and then load the dataset using PySpark. -.. Calculate the average `Score` for the reviews, grouped by `ProductId`. (There are 74258 `ProductId` values, so you do not need to display them all. If you `show()` the results, only 20 of the 74258 `ProductId` values and their average `Score` values will appear. That is OK for the purposes of this question.) -.. Save the output for all 74258 `ProductId` values and their average `Score` values to a file named `averageScores.csv`. - -[TIP] -==== -You may import the following modules: - -[source, python] ----- -from pyspark.sql import SparkSession -from pyspark.sql.functions import avg ----- - -While reading the csv file into a data frame, you may need to specify the option that tells PySpark that there are headers. Otherwise, the header will be treated as part of the data itself. -[source,python] ----- -read.option("header","true") ----- - -You may use the following option to make the column names accessible as DataFrame attributes. -[source,python] ----- -option("inferSchema","true") ----- - -After all the operations are complete, you may need to close the SparkSession. -[source,python] ----- -spark.stop() ----- - -A PySpark DataFrame's `write()` method is useful to write the results into a file. Here we give sample code that describes how to write a csv file to the current directory. - -[source,python] ----- -someDF.write.csv("file.csv",header= True) ----- -==== - -[TIP] -==== -It is not necessary to submit the file with the project solutions. -==== - - -=== Question 2 (2 points) - -.. Use PySpark SQL to calculate the average helpfulness ratio (HelpfulnessNumerator/HelpfulnessDenominator) for each product. -.. Save the output for all 74258 `ProductId` values and their average helpfulness ratio values to a file named `averageHelpfulness.csv`. - -[TIP] -==== -- You may need to use `filter()` to exclude rows with zeros in the column `HelpfulnessDenominator`, as follows: - -[source,python] ----- -filteredDF = myDF.filter(col("HelpfulnessDenominator")>0) ----- - -The `withColumn()` is useful for adding a new column to a DataFrame. For instance, in this example, the first argument is the new column, and the second argument specifies how the values of the new column value are to be created. - -[source,python] ----- -filteredDF.withColumn("HelpfulnessRatio",col("HelpfulnessNumerator") / col("HelpfulnessDenominator")) ----- - -A few more notes: - -- `groupBy('ProductId')` will perform the aggregation for each product -- `agg()` is useful for performing aggregation operations on the grouped data. It can take different kinds of aggregations as its argument, for instance, `avg`, `max`, `min` etc. -- Refer to https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html[.withColumn] -==== - -[TIP] -==== -It is not necessary to submit the file with the project solutions. -==== - -=== Question 3 (2 points) - -In questions 1 and 2, we used the batch processing mode to do the data processing. In other words, the dataset is processed in one go. Alternatively, we can use Spark Streaming concepts. This technique would allow us to even work on a data set in which the data is being provided in a real-time data stream. (In this project, we are just working with a fixed data set, but we still want students to know how to work with streaming data.) - -.. Please count the number of reviews for each `ProductId`, in a streaming style (simulating a real-time data monitoring and analytics). -.. Display the results from 20 rows of the output. - -[TIP] -==== -- To simplify the data processing, we will use the directory `/anvil/projects/tdm/data/amazon/spark` (which has a copy of the csv file in this directory) -- You may refer to the following statements to get the source directory for the dataset - -[source,python] ----- -import os -from pyspark.sql import SparkSession -from pyspark.sql.functions import count - -# Create a PySpark session -spark = SparkSession.builder.appName("Amazon Fine Food Reviews Streaming").getOrCreate() - -data_path = "/anvil/projects/tdm/data/amazon/spark/" -myschema = spark.read.option("header", "true").option("inferSchema", "true").csv(data_path) -streamingDF = spark.readStream.schema(myschema.schema).option("header", "true").csv(data_path) ----- - -You may use a `start()` method on the query to start the streaming computation. You may also an `awaitTermination()` method, to keep the application running indefinitely (until manually stopped, or until an error occurs). This will allow Spark to continuously process incoming data. -==== - -[IMPORTANT] -==== -- You may need to restart the kernel if you make a new Spark session. -==== - - -=== Question 4 (2 points) - -Use a streaming session like you did in Question 3. - -.. Display the `ProductId` values and `Score` values for the first 20 rows in which the `Score` is strictly larger than 3. Output these values to the screen as the new data arrives in the streaming session. - - -[TIP] -==== -Filtering streaming data for reviews with a score strictly greater than 3 is a straightforward operation. You may use a filter condition on the streaming DataFrame, for instance, like this - -[source,python] ----- -.select("ProductId","Score").where("Score > 3") ----- - -It is also necessary to remove the `.outputMode("complete")` because we are no longer aggregating results from a complete stream. Instead, we are just outputting first 20 results that satisfy the given criteria that the `Score` is strictly larger than 3. -==== - - - -=== Question 5 (2 points) - -.. Please state your understanding of PySpark streaming concepts in 2 or more sentences. - - -Project 09 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and outputs for the assignment - ** `firstname-lastname-project09.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project09.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project10.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project10.adoc deleted file mode 100644 index 77f268ed0..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project10.adoc +++ /dev/null @@ -1,211 +0,0 @@ -= TDM 20200: Project 10 -- 2024 - -**Motivation:** Machine learning and AI are huge buzzwords in industry. In this project, we will delve into an introduction of some machine learning related libraries in Python, like `tensorflow` and `scikit-learn`. We aim to understand some basic machine learning workflow concepts. - -**Context:** The purpose of these projects is to give you exposure to machine learning tools, some basic functionality, and to show _why_ they are useful, without needing any special math or statistics background. We will try to build a model to predict the arrival delay (ArrDelay) of flights, based on features like departure delay, distance of the flight, departure time, arrival time, etc. - -**Scope:** Python, tensorflow, scikit-learn - -== Dataset - -`/anvil/projects/tdm/data/flights/2014.csv` - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html[Pandas read_csv] -- https://scikit-learn.org/stable/documentation.html[scikit-learn documentation] -- https://scikit-learn.org/stable/tutorial/index.html[scikit-learn tutorial] -- https://www.tensorflow.org/tutorials[tensorflow tutorial] -- https://www.youtube.com/tensorflow[youtube for tensorflow] - -==== - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 10 this week. -==== - -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -[IMPORTANT] -==== -We added https://the-examples-book.com/programming-languages/python/hints-for-TDM-20200-project-10[five new videos] to help you with Project 10. BUT the example videos are about a data set with beer reviews. You need to (instead) work on the flight data given here: `/anvil/projects/tdm/data/flights/2014.csv` -==== - -== Questions - -=== Question 1 (2 points) - -For this project, we will only need these rows of the data set: - -[source, python] ----- -mycols = [ - 'DepDelay', 'ArrDelay', 'Distance', - 'CarrierDelay', 'WeatherDelay', - 'DepTime', 'ArrTime', 'Diverted', 'AirTime' -] ----- - -[loweralpha] -.. Load just a few rows of the data set. Explore the dataset columns, and figure out the data types for the following specific columns. Based on your exploration, define a dictionary variable called `my_col_types` that hold the column names and the types of each of the columns listed in `mycols`. -.. Now load the first 10,000 rows of the data set (but only the columns specified in `mycols`) into a data frame called `myDF`. - -[TIP] -==== -- You may refer to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html[pandas read_csv] to know how to read partial data. -- You may use the parameter `nrows=10000` in the `read_csv()` method. -==== - -=== Question 2 (2 points) - -.. Import the following libraries -+ -[source,python] ----- -import tensorflow as tf -from sklearn.model_selection import train_test_split -from sklearn.preprocessing import StandardScaler -import pandas as pd -import numpy as np -import time ----- - -.. For each column, fill in the missing values within that column, using the median value of that column. - -(Note: This works in this situation, because all of our columns contain numerical data. In the future, if you want to fill in missing values in a column that contains an `object` data type, we need to use a different procedure.) - -(Another note: We are filling in missing values because a machine learning model can be confused by missing values. Some machine learning models depend on every item, to make a decision.) - - - -=== Question 3 (2 points) - -Now let's look into how to prepare our features and labels for the machine learning model. You may use the following example code. - -[source,python] ----- -# Splitting features and labels -features = myDF.drop('ArrDelay', axis=1) -labels = myDF['ArrDelay'] ----- - -.. What is the difference between features and labels? -.. Considering the following example code, why do we need to have our data split into training and testing sets? - -[source,python] ----- -# Split -X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42) ----- - -[TIP] -==== -- In machine learning, `features` are the information used to make predictions, and `labels` are the outcomes for predictions. For instance, if we predict whether it will rain, the features might be wind speed, pressure, humidity, etc., and label would be rain or no rain. -- In the example code at the start of question 3, we are setting things up in this way: The features include all variables except the `ArrDelay`, and the only label is the variable `ArrDelay`. -==== - -[NOTE] -==== -- You may need to understand what are training and testing sets. The training set is used to train the model. The testing set will validate the model's predictive power. -- When we use `test_size=0.2`, we are specifying that 80% of the data will be used as training data, and 20% of the data will be used to test the model's performance. -- When we use `random_state=42` we are ensuring that the random number generator's sequence of numbers is reproducible across multiple runs of the code. Thus, we ensure that we get the same split of data into training and testing sets. This is basically seeding a random number generator, so that (by using the same seed each time) we get the same split each time. The value 42 is often used by convention; in other words, many people just use the value 42 to seed the random number generator. -==== - -=== Question 4 (2 points) - -.. Now let us standardize our data, using this example code. -+ -[source,python] ----- -scaler = StandardScaler() -X_train_scaled = scaler.fit_transform(X_train).astype(np.float32) -X_test_scaled = scaler.transform(X_test).astype(np.float32) ----- -+ -[NOTE] -==== -This is what scaling does to the data, and the reason why we need it for machine learning models: - -- Machine learning models usually assume all features are on a similar scale. So data need to be standardized to be in a common scale -.. Standardizing is like to translate and rescale every point on a graph to fit within a new frame, so the machine learning model can understand better -.. StandardScaler() is a function used to pre-process data before feeding it into a machine learning model -.. The StandardScaler adjusts data features so they have a mean of 0 and a standard deviation of 1, making models like neural networks perform better because they're sensitive to the scale of input data. -==== -.. Now let us slice our data, using this example code. -+ -[source,python] ----- -train_dataset = tf.data.Dataset.from_tensor_slices((X_train_scaled, y_train)).batch(14) -test_dataset = tf.data.Dataset.from_tensor_slices((X_test_scaled, y_test)).batch(14) ----- -+ -[NOTE] -==== -This is a brief description about how TensorFlow slices data: - -- `from_tensor_slices()` is a function that takes tuples of arrays (or tensors) as input, and outputs a dataset. Each element is a slice from these arrays in tuples format. Each element is a tuple of one row from `X` (features), and a corresponding row from `Y` (labels). This technique allows the model to match each input with a corresponding output. -- `batch(14)` divides the dataset into batches of 14 elements. Instead of feeding all of the data to the model at one time, the data then (instead) be processed iteratively, so that the computation is not too memory-intensive. -.. We can choose how many pieces of data are used at a time. For instance, we can use 14 slices at a time. The number of slices can impact the model's performance and how long it takes the model to learn. You may need to try different numbers, to figure out which works best. -==== - -=== Question 5 (2 points) - -.. Now (finally!) we will build a machine learning model, we will train it, and we will evaluate it using TensorFlow. The following example code defines a model architecture, compiles the model, trains the model on a dataset, and evaluates it on a separate dataset, to ensure the model's effectiveness. Please create and run the whole program, named: load the dataset, clean the data, specify the features and labels, specify the training and testing data, define the model, compile and train the model, and clean things up, after building the model. -+ -[source,python] ----- -# Define model -model = tf.keras.Sequential([ - tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)), - tf.keras.layers.Dropout(0.2), - tf.keras.layers.Dense(1) -]) - -# Compile -model.compile(optimizer='adam', - loss='mean_squared_error', - metrics=['mean_absolute_error']) - -# Train -history = model.fit(train_dataset, epochs=10, validation_data=test_dataset) - -# Cleanup -del X_train_scaled, X_test_scaled, train_dataset, test_dataset - ----- - -In the next project, we will reflect on what we learned during this project. We will continue to explore! - -[NOTE] -==== -- Building a model includes defining the model structure, training it on data, and testing its performance. -- The example code defines a simple neural network model with layers, to find patterns in the dataset. -.. `tf.keras.Sequential()` defines the structure of the model and how it will learn from the data. It sets up the sequence of steps/layers. The model will process the layers to get patterns, and will learn from patterns, to make predictions. -.. `model.compile` sets up the model's learning method: using the `adam` algorithm to do adjustments, the `mean_squared_error` to measure the accuracy of the model's prediction, and the `mean_absolute_error` to average out how much the predictions differ from the real values. -.. `model.fit()` is the function that starts the learning process, using training data, and then checks the performance, with testing data. -.. `Epoch` is one complete pass through the entire training dataset. The model is set to go through 10 epochs. -==== - -Project 10 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and outputs for the assignment - ** `firstname-lastname-project10.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project10.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project11.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project11.adoc deleted file mode 100644 index 3afdd3a92..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project11.adoc +++ /dev/null @@ -1,104 +0,0 @@ -= TDM 20200: Project 11 -- 2024 - -**Motivation:** Machine learning and AI are huge buzzwords in industry, in this project, we will continue to learn more TensorFlow features. - -**Context:** The purpose of these projects is to give you exposure to machine learning tools, some basic functionality, and a conceptual workflow to create and use a model *without needing any special math or statistics background*. - -**Scope:** Python, tensorflow, scikit-learn, numpy - -== Dataset - -`/anvil/projects/tdm/data/flights/2019.csv` - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html[Pandas read_csv] -- https://scikit-learn.org/stable/documentation.html[scikit-learn documentation] -- https://scikit-learn.org/stable/tutorial/index.html[scikit-learn tutorial] -- https://www.tensorflow.org/tutorials[tensorflow tutorial] -- https://www.youtube.com/tensorflow[youtube for tensorflow] -- https://joblib.readthedocs.io/en/latest/why.html[joblib dump() load()] -- https://proclusacademy.com/blog/explainer/regression-metrics-you-must-know/[metrics] -==== - -[WARNING] -==== -You need to use 2 cores for your Jupyter Lab session for Project 11 this week. -==== -[TIP] -==== -You can use `pd.set_option('display.max_columns', None)` if you want to see all of the columns in a very wide data frame. -==== - -[IMPORTANT] -==== -We added a video (below) to help you with Project 11. BUT the example video is about a data set with beer reviews. You need to (instead) work on the flight data given here: `/anvil/projects/tdm/data/flights/2014.csv` and also here `/anvil/projects/tdm/data/flights/2019.csv` -==== - -++++ - -++++ - - -== Questions - -=== Question 1 (2 points) - -[loweralpha] - -In the previous project you created a tensorflow model with limited data. Since it would need large data in order to create a meaningful tensorflow model, your model may not work well! Nonetheless, we can still learn how we can create (and use!) the model, and how to check the performance power of the model. - -.. First update your program from Project 10, building the `model` with more data. Please use `nrows = 100000` from the data set `2014.csv`. The test/training split should again be defined using: - -`X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)` - -and using `epochs=10` when training the model. - -[NOTE] -==== -- In question 1, we are using 2014 data to train the model. In question 2, we will use 2019 data to test the model. Please name your variables in a way that will enable you and the graders to both understand which variables are which. -==== - - -=== Question 2 (2 points) - -.. Read in 100000 lines of data from the `2019.csv` file. -.. Save the predicted arrival delays as `predicted_arrival_delays_100k_2019` (or something similar) -.. Save the actual arrival delays as `actual_arrival_delays_100k_2019` (or something similar) - - -=== Question 3 (2 points) - -Solve questions 1 and 2 again, this time using 500000 rows from the 2014 data and 500000 rows from the 2019 data. Be sure to change all of your variable names accordingly. - -=== Question 4 (2 points) - -Use the data from question 2 (with 100000 rows of data), to study the predicted arrival delays for 2019 versus the actual arrival delays for 2019. Please comment on what you find. - -Be sure to (please) provide some explanation about what you learn, and (likely) some visualizations to justify your work. - - -=== Question 5 (2 points) - -Now use the data from question 3 (with 500000 rows of data), to study the predicted arrival delays for 2019 versus the actual arrival delays for 2019. Please comment on what you find. Be sure to compare the effectiveness of using 100000 rows of data versus using 500000 rows of data. - - -Project 11 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and outputs for the assignment - ** `firstname-lastname-project11.ipynb` -* Python file with code and comments for the assignment - ** `firstname-lastname-project11.py` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project12.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project12.adoc deleted file mode 100644 index 6c2e3a01a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project12.adoc +++ /dev/null @@ -1,235 +0,0 @@ -= TDM 20200: Project 12 -- 2024 - -**Motivation:** Containers are everywhere and a very popular method of packaging an application with all of the requisite dependencies. This project we will learn some basics of containerization in a virtual environment using Alpine Linux. We first will start a virtual machine on Anvil, then create a simple container in the virtual machine. You may find more information about container and relationship between virtual machine and container here: https://www.redhat.com/en/topics/containers/whats-a-linux-container - -**Context:** The project is to provide very foundational knowledge about containers and virtualization, focusing on theoretical understanding and basic system interactions. - -**Scope:** Python, containers, UNIX - -.Learning Objectives -**** -- Improve your mental model of what a container is and why it is useful. -- Use UNIX tools to effectively create a container. -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - -=== Question 1 (2 pts) - -[loweralpha] - -.. Logon to Anvil and use a bash command to find an available port you may use later - -[TIP] -==== -Example code, you may modify or create your own code if needed -[source, bash] ----- - -#! /bin/bash - -for port in {1025..65535}; do - python3 -c "import socket; s=socket.socket(socket.AF_INET, socket.SOCK_STREAM); result=s.connect_ex(('127.0.0.1', $port)); s.close(); exit(result)" - if [ $? -ne 0 ]; then - echo "port $port open" - break - fi -done - ----- -==== - -=== Question 2 (2 pts) - -.. Launch a virtual machine (VM) on Anvil. (Note that Docker is already pre-installed on Anvil.) Submit the output showing the job id and process id, after you start a virtual machine; it should look like this, for example: - -[source,bash] ----- -.output -[1] 3152048 ----- - -[NOTE] -==== -The most popular containerization tool at the time of writing is likely Docker. We will Launch a virtual machine on Anvil (which already has Docker pre-installed). - -Open up a terminal on Anvil. You may do it from within Jupyter Lab. Run the following code, to ensure that the SLURM environment variables don't alter or effect our SLURM job. - -[source,bash] ----- -for i in $(env | awk -F= '/SLURM/ {print $1}'); do unset $i; done; ----- - -Next, let's make a copy of a pre-made operating system image. This image has Alpine Linux and a few basic tools installed, including: nano, vim, emacs, and Docker. - -[source,bash] ----- -cp /anvil/projects/tdm/apps/qemu/images/builder.qcow2 $SCRATCH ----- - -Next, we need to make `qemu` available to our shell. Open a terminal and run the following code - -[source,bash] ----- -module load qemu -# check the module loaded -module list ----- - -Next, let's launch our virtual machine with about 8GB of memory and 4 cores. Replace the "1025" with the port number that you got from question 1. - -[source,bash] ----- -qemu-system-x86_64 -vnc none,ipv4=on -hda $SCRATCH/builder.qcow2 -m 8G -smp 4 -enable-kvm -net nic -net user,hostfwd=tcp::1025-:22 & - ----- - -[IMPORTANT] -==== -- `1025` is an example port number; this needs to be replaced with your port number! -==== - -Next, it is time to connect to our virtual machine. We will use `ssh` to do this. - -[source,bash] ----- -ssh -p 1025 tdm@localhost -o StrictHostKeyChecking=no ----- - -If the command fails, try waiting a minute and rerunning the command -- it may take a minute for the virtual machine to boot up. - -When prompted for a password, enter `purdue`. Your username is `tdm` and password is `purdue`. - -Finally, now that you have a shell in your virtual machine, you can do anything you want! You have superuser permissions within your virtual machine! -For this question, submit a screenshot showing the output of `hostname` from within your virtual machine! - -==== - - -=== Question 3 (2 pts) - -.. Use `df -h` to check the disk space. -.. Use `printenv` to get the user's environment variables. Choose 2 of those environment variables and explain their meanings. -.. Use `ls` to list the files in the current directory -+ -[TIP] -==== -- You may refer to the following sample code (or create your own approach) -- If `ls` does not return anything, use `ls -la` -[source, bash] ----- -df -h -printenv -ls ----- -==== - -=== Question 4 (2 pts) -.. Write and execute a shell script that calculates the number of files in the current directory and displays the result in a formatted message. - -[TIP] -==== -[source, bash] ----- -echo 'echo "There are $(ls | wc -l) files in the current directory."' > countFiles.sh ----- - -- run the shell script - -[source, bash] ----- -chmod +x countFiles.sh -./countFiles.sh ----- -==== - - -=== Question 5 (2 pts) - -After you complete the previous questions, you can see that you can use the virtual machine just like your own computer. Now use the following steps, to use Docker within the virtual machine to create and manage a container. Run all the commands in your terminal. Copy the output to your Jupyter Lab cells. - -.. List the docker version inside the virtual machine -+ -[source, bash] ----- -docker --version ----- -+ -.. Pull the "ubuntu" image from Docker Hub -+ -[source, bash] ----- -docker pull ubuntu ----- -+ -..Run a container based on the "ubuntu" image -+ -[source, bash] ----- -docker run -it ubuntu bash ----- -+ -[NOTE] -==== -When the command runs, docker will create a container from the `ubuntu` image and run it. -==== -+ -.. Once inside the container shell, you should see the prompt changed to root@. Run the following command to install `cowsay` -+ -[source,bash] ----- -apt-get update && apt-get install -y cowsay ----- -+ -.. Now find the directory that `cowsay` locates. Go to that directory to run `cowsay` with following command -+ -[source,bash] ----- -./cowsay "Your greetings here :)" ----- -+ -.. Use `exit` to leave the container -+ -[source,bash] ----- -exit ----- -+ -.. List the container(s) with following command. It will provide you with a list of all of the containers that are currently running. -+ -[source, bash] ----- -docker ps -a ----- -+ -.. After you confirm that the container ran successfully, you may using following command to remove it. -+ -[source, bash] ----- -docker rm [Container_id] ----- -+ -[TIP] -==== -Replace [Container_id] with the id that you got from previous question. -==== - - - -Project 12 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project12.ipynb` -* bash file with code and comments for the assignment - ** `firstname-lastname-project12.sh` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project13.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project13.adoc deleted file mode 100644 index b4bdd353a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project13.adoc +++ /dev/null @@ -1,168 +0,0 @@ -= TDM 20200: Project 13 -- Spring 2024 - -**Motivation:** In project 12 we created a container by pulling the `ubuntu` image. In this project we will try to create one simple container with a Virtual machine (as a simple web application). - -**Context:** Create simple conceptual web app using `Flask` and create a container for the web app - -**Scope:** Python, Visual Studio Code, container, flask - -.Learning Objectives -**** -- Create a conceptual web application using flask on Anvil -- Create a container for the web app -- Develop skills and techniques to create a `container` -- Understand the `Flask` framework -**** - -== Readings and Resources - -[NOTE] -==== -- Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. -- https://docker-curriculum.com/[docker introduction] -- https://flask.palletsprojects.com/en/3.0.x/tutorial/[flask tutorial] -==== - -== Questions - -=== Question 1 (2 points) - -[loweralpha] -.. Start a QEMU virtual machine and connect to it via `ssh`. -+ -[TIP] -==== -- You may refer back to Project 12, where we learned the steps needed to start a virtual machine and connect to it. -==== - -=== Question 2 (2 points) - -.. Use the statement `docker pull python:3.10-slim` to pull a Python image from Docker Hub. (You do not need to run the docker image yet; just pull it for now.) -.. Create a new directory named `flaskP13App`. Inside that directory, using an editor (such `nano` or `vim`), create a simple file called `helloWorld.py` which we will later use to display a sentence in a webpage. The sentence may be simple, e.g., "Hello from Flask inside a Docker container!" -+ -[HINT] -==== -- This https://flask.palletsprojects.com/en/3.0.x/tutorial/[flask tutorial] should be helpful, for students who want to know more about making a Flask app. For our purposes, our `helloWorld.py` file can be as simple as the following: -[source, makefile] ----- -from flask import Flask - -app = Flask(__name__) - -@app.route('/') -def hello_world(): - return 'Hello from Flask inside a Docker container!' - -if __name__ == '__main__': - app.run(host='0.0.0.0', port=8050, debug=True) - ----- -==== -+ -[WARNING] -==== -Make SURE that your indentations look correct in your file! -==== -+ -.. Create a `requirements.txt` file; it is a list of packages or libraries needed to work on a project. For instance, your `requirements.txt` might look like this: -+ -[source, makefile] ----- -flask==2.1.1 -jinja2==3.0.2 -werkzeug==2.0.1 - ----- - -=== Question 3 (2 points) - -.. Create a file called `Dockerfile` inside the `flaskP13App` directory. -If you want a full introduction to Docker, you can read this https://docker-curriculum.com/[docker introduction], but for our purposes, the file called `Dockerfile` can just look like this: - ----- -# example of the contents of a dockerfile - -# Use an official Python runtime as a parent image -FROM python:3.8-slim - -# Set the working directory in the container -WORKDIR /home/tdm - -# Copy all of the files from the current directory into the Docker container -COPY . . - -# Install the packages from the requirements file -RUN pip install -r /home/tdm/requirements.txt - -# Install Flask -RUN pip install Flask - -# Make port 8050 available to the world outside this container -EXPOSE 8050 - -# Define environment variable -ENV NAME World - -# Run helloWorld.py when the container launches -CMD ["python", "helloWorld.py"] - ----- - - - -=== Question 4 (2 points) - -.. Build the Docker image using the command -[source,bash] ----- -docker build -t flaskapp . ----- - -(Don't forget the period at the end of the command! The period refers to the current directory.) - -There will be a couple of WARNING sentences written in red, but do not worry! - -=== Question 5 (2 points) - -.. Run the Flask App in a Docker Container using the command -[source,bash] ----- -docker run flaskapp ----- - -For the output from this question, you may choose to just take a screen shot of your terminal output, showing that everything is running successfully, for instance, like this: - -.output ----- -localhost:~/flaskP13App% docker run flaskapp - * Serving Flask app 'helloWorld' (lazy loading) - * Environment: production - WARNING: This is a development server. Do not use it in a production deployment. - Use a production WSGI server instead. - * Debug mode: on - * Running on all addresses. - WARNING: This is a development server. Do not use it in a production deployment. - * Running on http://172.17.0.2:8050/ (Press CTRL+C to quit) - * Restarting with stat - * Debugger is active! - * Debugger PIN: 967-939-308 - ----- - - -Project 13 Assignment Checklist -==== -* Jupyter Lab notebook with your code, comments and output for the assignment - ** `firstname-lastname-project13.ipynb` -* bash file with code and comments for the assignment - ** `firstname-lastname-project13.sh` - -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14-teachingprogramming.adoc deleted file mode 100644 index e9f635395..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14-teachingprogramming.adoc +++ /dev/null @@ -1,60 +0,0 @@ -= TDM 20200: Project 14 -- Spring 2024 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - -**Scope:** Python, Jupyter Lab, Anvil - - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about projects submissions xref:submissions.adoc[here]. - -== Questions - - -=== Question 1 (1 pt) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (1 pt) - -.. Reflecting on your experience working with different commands, functions, modules, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, module, or package. - -=== Question 3 (2 pts) - -.. While working on the projects, including statistics and testing, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 4 (2 pts) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 5 (2 pts) - -.. Please identify 3 skills or topics in data science areas you are interested in, you may choose from the following list or create your own list. Please briefly explain the reason you think the topics will be beneficial, with examples. - -- database optimization -- containerization -- machine learning -- generative AI -- deep learning -- cloud computing -- DevOps -- GPU computing -- data visualization -- time series and spatial statistics -- predictive analytics -- (if you have other topics that you want Dr Ward to add, please feel welcome to post in Piazza, and/or just add your own topics when you answer this question) - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14.adoc deleted file mode 100644 index 0c46ea68b..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-project14.adoc +++ /dev/null @@ -1,56 +0,0 @@ -= TDM 20200: Project 14 -- Spring 2024 - -**Motivation:** We covered a _lot_ this year! When dealing with data driven projects, it is crucial to thoroughly explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance. As this is our final project for the semester, its primary purpose is survey based. You will answer a few questions mostly by revisiting the projects you have completed. - -**Context:** We are on the last project where we will revisit our previous work to consolidate our learning and insights. This reflection also help us to set our expectations for the upcoming semester - - -== Questions - - -=== Question 1 (2 pts) - -.. Reflecting on your experience working with different datasets, which one did you find most enjoyable, and why? Discuss how this dataset's features influenced your analysis and visualization strategies. Illustrate your explanation with an example from one question that you worked on, using the dataset. - -=== Question 2 (2 pts) - -.. Reflecting on your experience working with different commands, functions, and packages, which one is your favorite, and why do you enjoy learning about it? Please provide an example from one question that you worked on, using this command, function, or package. - -=== Question 3 (2 pts) - -.. While working on the projects, including web scraping, data visualization, machine learning, and containerization, what steps did you take to ensure that the results were right? Please illustrate your approach using an example from one problem that you addressed this semester. - -=== Question 4 (2 pts) - -.. Reflecting on the projects that you completed, which question(s) did you feel were most confusing, and how could they be made clearer? Please use a specific question to illustrate your points. - -=== Question 5 (2 pts) - -.. Please identify 3 skills or topics in data science areas you are interested in. You may choose from the following list or create your own list. Please briefly explain the reason you think the topics will be beneficial, with examples. - -- database optimization -- containerization -- machine learning -- generative AI -- deep learning -- cloud computing -- DevOps -- GPU computing -- data visualization -- time series and spatial statistics -- predictive analytics -- (if you have other topics that you want Dr Ward to add, please feel welcome to post in Piazza, and/or just add your own topics when you answer this question) - -Project 14 Assignment Checklist -==== -* Jupyter Lab notebook with your answers and examples. You may just use markdown format for all questions. - ** `firstname-lastname-project14.ipynb` -* Submit files through Gradescope -==== - -[WARNING] -==== -_Please_ make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you _think_ you submitted, was what you _actually_ submitted. - -In addition, please review our xref:submissions.adoc[submission guidelines] before submitting your project. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-projects.adoc deleted file mode 100644 index 590cf94fb..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/20200/20200-2024-projects.adoc +++ /dev/null @@ -1,47 +0,0 @@ -= TDM 20200 - -== Project links - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -[%header,format=csv,stripes=even,%autowidth.stretch] -|=== -include::ROOT:example$20200-2024-projects.csv[] -|=== - -[WARNING] -==== -Projects are **released on Thursdays**, and are due 1 week and 1 day later on the following **Friday, by 11:55pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:current-projects:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== - -== Piazza - -[NOTE] -==== -Piazza links remain the same from Fall 2023 to Spring 2024. -==== - -=== Sign up - -https://piazza.com/purdue/fall2022/tdm10100[https://piazza.com/purdue/fall2022/tdm10100] - -=== Link - -https://piazza.com/purdue/fall2022/tdm10100/home[https://piazza.com/purdue/fall2022/tdm10100/home] - - -== Syllabus - -Navigate to the xref:spring2024/logistics/syllabus.adoc[syllabus]. diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/30200-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/30200-2024-projects.adoc deleted file mode 100644 index f7bbdd880..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/30200-2024-projects.adoc +++ /dev/null @@ -1,30 +0,0 @@ -= TDM 30200 - -All of the projects are in a Jupyter notebook and contain the instructions and links to materials needed to be successful for that project. Each week we will release a folder that contains the notebook (as well as supplementary data or images). The general process to retrieve and work on each project is: - -0. Make a Seminar directory and get the fetch_projects_30200 notebook (Do This Once) -1. Use the fetch_projects Notebook to Retrieve Project Each Week -2. Finish the Project -3. Submit to Gradescope - -== 0. Make a Seminar directory and get the Fetch Project notebook (Do This Once) - -Run the following in a terminal on Anvil: - -[source,bash] ----- -mkdir -p $HOME/seminar/tdm_30200/ -cp /anvil/projects/tdm/seminar/tdm_30200/fetch_projects_30200.ipynb $HOME/seminar/ ----- - -== 1. Use the fetch_projects_30200 Notebook to Retrieve Project Each Week - -Each week you can open this notebook, go to the corresponding week, and run the bash command to retrieve the notebook. - -== 2. Finish the Project - -Once fetched, the project will appear in your home directory (in the `seminar` folder). - -== 3. Submit to Gradescope - -Go to Gradescope and submit the code like usual (see https://the-examples-book.com/projects/current-projects/submissions). diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/40200-2024-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/40200-2024-projects.adoc deleted file mode 100644 index c59fbdc07..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/40200-2024-projects.adoc +++ /dev/null @@ -1,30 +0,0 @@ -= TDM 40200 - -All of the projects are in a Jupyter notebook and contain the instructions and links to materials needed to be successful for that project. Each week we will release a folder that contains the notebook (as well as supplementary data or images). The general process to retrieve and work on each project is: - -0. Make a Seminar directory and get the fetch_projects_40200 notebook (Do This Once) -1. Use the fetch_projects Notebook to Retrieve Project Each Week -2. Finish the Project -3. Submit to Gradescope - -== 0. Make a Seminar directory and get the Fetch Project notebook (Do This Once) - -Run the following in a terminal on Anvil: - -[source,bash] ----- -mkdir -p $HOME/seminar/tdm_40200/ -cp /anvil/projects/tdm/seminar/tdm_40200/fetch_projects_40200.ipynb $HOME/seminar/ ----- - -== 1. Use the fetch_projects_40200 Notebook to Retrieve Project Each Week - -Each week you can open this notebook, go to the corresponding week, and run the bash command to retrieve the notebook. - -== 2. Finish the Project - -Once fetched, the project will appear in your home directory (in the `seminar` folder). - -== 3. Submit to Gradescope - -Go to Gradescope and submit the code like usual (see https://the-examples-book.com/projects/current-projects/submissions). diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/chatbot-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/chatbot-teachingprogramming.adoc deleted file mode 100644 index 8cbc52efd..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/chatbot-teachingprogramming.adoc +++ /dev/null @@ -1,50 +0,0 @@ -= Chatbots - -Chatbots are applications that communicate in human languages. These applications vary from simple hardcoded textual responses to voice activated, neural network systems that dynamically respond to a situation and learn from the speakers (think Alexa). - -== Common Applications - -Chatbots are very common nowadays. From customer support systems, to front end recommender systems, websites and more, chatbots are also one technology that has found a use case in nearly every industry. Some of the most common industries that use chatbots include: - -- Retail -- Banking -- Customer Service -- Travel - -== A Brief History - -The Turing Test, presented in 1950, claimed that a machine would have intelligent behavior if, when asked a series of questions, it could respond like a human would. This amounts to a chatbot. In the 1960's, a chatbot called ELIZA seemed to meet this standard; although no one doubted that it was not conscious. Since then, we all can imagine the various chatbots that have found their way into our daily lives, from automated customer service systems, to simple self service phone agents, to Alexa, or even games. - -== Resources - -All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free. - -=== Videos - -https://www.youtube.com/watch?v=dvOnYLDg8_Y&list=PLQVvvaa0QuDdc2k5dwtDTyT9aCja0on8j[Creating a Chatbot with Deep Learning, Python and Tensorflow] - -https://www.youtube.com/watch?v=o9-ObGgfpEk[What is a Chatbot? (IBM, ~10 minutes)] - -https://www.youtube.com/watch?v=1lwddP0KUEg[Intelligent AI Chatbot in Python (~35 minutes)] - -https://www.youtube.com/watch?v=c_gXrw1RoKo[Build your own chatbot using Python | Python Tutorial for Beginners in 2022 (Great Learning, ~1 hour)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170476157001081[Building Intelligent Chatbots Using AWS (2019)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170475084401081[Building Chatbots Using Google Dialogflow (2019)] - -=== Books - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169492106401081[Building chatbots with Python using natural language processing and machine learning (2019)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_proquest_journals_1818658254[The return of the chatbots (2016)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_askewsholts_vlebooks_9783030042998[Developing Enterprise Chatbots: Learning Linguistic Structures (2019)] - -=== Articles - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_doaj_primary_oai_doaj_org_article_a4a77aac8cc844c98f259227899d7659[Building a Chatbot System to Analyze Opinions of English Comments (2023)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_arxiv_primary_2001_00100[Building chatbots from large scale domain-specific knowledge bases: challenges and opportunities (2019)] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_arxiv_primary_1710_00689[Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics (2017)] diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/deploy-and-access-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/deploy-and-access-teachingprogramming.adoc deleted file mode 100644 index 07199d2ee..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/deploy-and-access-teachingprogramming.adoc +++ /dev/null @@ -1,239 +0,0 @@ -= Prodigy Annotation Tool - -== Requesting Access for Your Team - -Prodigy is inherently best used for single-annotator and single-machine deployments. In other words, you install Prodigy locally to use it, alone, on annotating your data. The issue arises when you have one corpus of data to annotate, and >1 annotator. If each team member deploys their own instance of Prodigy, it requires more licenses, handling dependencies and installation many times, and distrubuting and allocating work becomes a nightmare to avoid duplicate data. - -Prodigy has attempted to address the multi-annotator concern with Prodigy Teams. Explosion AI (the makers of spaCy and Prodigy) describe this platform as the following: - -____ -Most annotation projects need to start with relatively few annotators, to make sure the annotation scheme and onboarding process allows high inter-annotator consistency. Once you have your annotation process running smoothly, there are a few options for scaling up your project for more annotators. One option we recommend is to divide up the annotation work so that each annotator only needs to deal with a small part of the annotation scheme. For instance, if you’re working with many labels, you would start a number of different Prodigy services, each specifying a different label, and each advertising to a different URL. Prodigy can be easily run under automation, for instance within a Kubernetes cluster, to make this approach more manageable. If you do want to have multiple annotators working on one feed, Prodigy has support for that as well via named multi-user sessions. You can create annotator-specific queues using query parameters, or use the query parameters to distinguish the work of different annotators so you can run inter-annotator consistency checks. -____ - -However, the product is not yet available outside of beta testing. Therefore, we found a scalable workaround for us to use in the meantime! That solution will be explained in detail in the following sections. - -*In short, to gain access to Prodigy, you must contact the Data Mine (datamine@purdue.edu). When contacting please include the following details to help ensure you have the right solution:* - -* Your name -* Brief description of your project (< 250 words) -* Data description: - ** What kind of data - ** Size of dataset - ** Data governance and privacy/security concerns -* How many annotators will need access and their Purdue email aliases (e.g., mine is `gould29`) -* The CLI command to run for your prodigy session _OR_ the recipe you'd like to use, and any other parameters for it (e.g., NER on a base English SpaCy model to label for x, y, z) - -This information will help me (Justin) determine how you can best leverage Prodigy. - -== How does Purdue's Multi-annotator Solution Work? - -The workaround developed at Purdue University leverages Docker and Kubernetes on the https://www.rcac.purdue.edu/compute/geddes[Geddes] research cluster. Below is a high-level breakdown of the annotation infrastructure and NLP ecosystem at Purdue: - -* Base NLP Docker Image (`geddes-registry.rcac.purdue.edu/tdm/tdm/nlp:latest`) - ** This image, on the Harbor Registry, runs Python 3.6.9 and includes standard NLP Python packages, such as Tensorflow, PyTorch, NLTK, SpaCy, Stanford CoreNLP, Prodigy, and more. It also includes Jupyter Lab for NLP development and integration with GitHub via CLI. - ** The Data Mine has its own project on the repository, called `tdm`. It is public, and you can access the base NLP image on the https://geddes-registry.rcac.purdue.edu/harbor/sign-in?redirect_url=%2Fharbor%2Fprojects[Harbor Registry]. - -* TDM Namespace on Kubernetes - ** If you don't know what Kubernetes is, please visit my https://the-examples-book.com/starter-guides/data-engineering/containers/kubernetes[example in The Examples Book], and/or follow along my https://github.com/TheDataMine/geddes-kubernetes-deployment[example on GitHub], where you walk through deploying a Python web app and REST endpoints. - ** This namespace allows The Data Mine to allocate resources, memory, and storage to running applications on the Geddes K8s cluster. We access this through the web interface on https://beta.geddes.rcac.purdue.edu/c/local/storage/persistent-volumes[Rancher]. - ** In short, each annotating team will share a K8s deployment of Prodigy (deployed and managed by Justin Gould, under the `tdm` namespace) and have its own SQLite database to store their annotated data. - ** You will receive an endpoint to your Prodigy deployment to access. Your data will be loaded and ready to annotate, and set up in a way in which no annotator will annotate the same data. Prodigy natively handles the distribution of unannotated data to active users on the instance. - -== Base NLP Docker Image - -Specifically, the packages included are: -``` -absl-py==0.13.0 -aiofiles==0.7.0 -anyio==3.3.0 -argon2-cffi==20.1.0 -asn1crypto==0.24.0 -astunparse==1.6.3 -async-generator==1.10 -attrs==21.2.0 -Babel==2.9.1 -backcall==0.2.0 -bleach==4.0.0 -blis==0.7.4 -cached-property==1.5.2 -cachetools==4.2.2 -catalogue==2.0.5 -certifi==2021.5.30 -cffi==1.14.6 -charset-normalizer==2.0.4 -clang==5.0 -click==7.1.2 -contextvars==2.4 -corenlp-protobuf==3.8.0 -cryptography==2.1.4 -cymem==2.0.5 -dataclasses==0.8 -decorator==5.0.9 -defusedxml==0.7.1 -en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0.tar.gz -entrypoints==0.3 -fastapi==0.68.0 -flatbuffers==1.12 -gast==0.4.0 -gensim==4.0.1 -google-auth==1.34.0 -google-auth-oauthlib==0.4.5 -google-pasta==0.2.0 -grpcio==1.39.0 -h11==0.12.0 -h5py==3.1.0 -idna==3.2 -immutables==0.16 -importlib-metadata==4.6.3 -ipykernel==5.5.5 -ipython==7.16.1 -ipython-genutils==0.2.0 -jedi==0.18.0 -Jinja2==3.0.1 -joblib==1.0.1 -json5==0.9.6 -jsonschema==3.2.0 -jupyter-client==6.1.12 -jupyter-core==4.7.1 -jupyter-server==1.10.2 -jupyterlab==3.1.7 -jupyterlab-pygments==0.1.2 -jupyterlab-server==2.7.0 -keras==2.6.0 -Keras-Preprocessing==1.1.2 -keyring==10.6.0 -keyrings.alt==3.0 -Markdown==3.3.4 -MarkupSafe==2.0.1 -mistune==0.8.4 -murmurhash==1.0.5 -nbclassic==0.3.1 -nbclient==0.5.4 -nbconvert==6.0.7 -nbformat==5.1.3 -nest-asyncio==1.5.1 -nltk==3.6.2 -notebook==6.4.3 -numpy==1.19.5 -oauthlib==3.1.1 -opt-einsum==3.3.0 -packaging==21.0 -pandas==1.1.5 -pandocfilters==1.4.3 -parso==0.8.2 -pathy==0.6.0 -peewee==3.14.4 -pexpect==4.8.0 -pickleshare==0.7.5 -plac==1.1.3 -preshed==3.0.5 -prodigy @ file:///workspace/prodigy-1.11.0-cp36-cp36m-linux_x86_64.whl -prometheus-client==0.11.0 -prompt-toolkit==3.0.19 -protobuf==3.17.3 -ptyprocess==0.7.0 -pyasn1==0.4.8 -pyasn1-modules==0.2.8 -pycorenlp==0.3.0 -pycparser==2.20 -pycrypto==2.6.1 -pydantic==1.8.2 -Pygments==2.10.0 -PyGObject==3.26.1 -PyJWT==2.1.0 -pyparsing==2.4.7 -pyrsistent==0.18.0 -python-apt==1.6.5+ubuntu0.6 -python-dateutil==2.8.2 -pytz==2021.1 -pyxdg==0.25 -pyzmq==22.2.1 -regex==2021.8.3 -requests==2.26.0 -requests-oauthlib==1.3.0 -requests-unixsocket==0.2.0 -rsa==4.7.2 -scipy==1.5.4 -SecretStorage==2.3.1 -Send2Trash==1.8.0 -six==1.15.0 -smart-open==5.1.0 -sniffio==1.2.0 -spacy==3.1.1 -spacy-legacy==3.0.8 -srsly==2.4.1 -stanford-corenlp==3.9.2 -starlette==0.14.2 -tensorboard==2.6.0 -tensorboard-data-server==0.6.1 -tensorboard-plugin-wit==1.8.0 -tensorflow==2.6.0 -tensorflow-estimator==2.6.0 -tensorflow-hub==0.12.0 -termcolor==1.1.0 -terminado==0.11.0 -testpath==0.5.0 -thinc==8.0.8 -toolz==0.11.1 -tornado==6.1 -tqdm==4.62.1 -traitlets==4.3.3 -typer==0.3.2 -typing-extensions==3.7.4.3 -urllib3==1.26.6 -uvicorn==0.13.4 -uvloop==0.14.0 -wasabi==0.8.2 -wcwidth==0.2.5 -webencodings==0.5.1 -websocket-client==1.2.1 -Werkzeug==2.0.1 -wrapt==1.12.1 -zipp==3.5.0 -``` - -To pull and use this image, use the following command: -```console -docker pull geddes-registry.rcac.purdue.edu/tdm/tdm/nlp@sha256:e018359afee1f9fb56b2924d27980483981680b38a64c69472c5f4838c0c6edc -``` - -This Docker image essentially sets the stage for NLP work at Purdue. It includes almost anything you need to get started. Users are more than welcome (and encouraged!) to use this as a starting point and reference it in your own project-specific Docker images. As it stands, this Docker image is configured to support Prodigy 1.11, as of August 2021. - -== Kubernetes Deployment - -As stated, each requesting team will have their own deployment of Prodigy. These will live under the Data Mine namespace on Rancher. Furthermore, each team will have their own SQLite database, to ensure security of data. - -.A few important changes to note on how I deploy these instances of Prodigy: -. I reference the NLP base Docker image in my workflow on Kubernetes -. Under `ENVIRONMENT VARIABLES`, I make the following changes: - * `PRODIGY_HOME=/workspace/.prodigy` - ** This sets the home location of Prodigy to The Data Mine's namespace's storage volume (i.e., this is where you can find the standard config file and SQLite databases; access is restricted to Data Mine staff only) - * `PRODIGY_ALLOWED_SESSIONS=alias,alias,alias,...` - ** Define comma-separated string names of multi-user session names that are allowed in the app. I will set this to the Purdue aliases. Only THESE individuals are permitted to access the annotator tool. - ** *NOTE: You must add `?session=YOUR_PURDUE_ALIAS` to the end of the provided endpoint. Failure to do so will result in an error and no access to data.* - *** For example, `http://172.21.160.164:9000/` becomes `http://172.21.160.164:9000?session=ALIAS` - * `PRODIGY_CONFIG_OVERRIDES= {"feed_overlap" : false,"port" : 9000, "host" : "0.0.0.0", "db_settings": {"sqlite": {"name": "team_name.db","path": "/workspace/.prodigy"}}}` - ** JSON object with overrides to apply to config. I use this to specify a new database for each team. By default, there is one SQLite database (`prodigy.db`). However, we want each team to have its own database; therefore, we must dynamically change the configuration file for each team, requiring an override. - ** Let's break down this config override... - *** `feed_overlap` as `false`: The `feed_overlap` setting in your prodigy.json or recipe config lets you configure how examples should be sent out across multiple sessions. If true, each example in the dataset will be sent out once for each session, so you’ll end up with overlapping annotations (e.g. one per example per annotator). Setting `feed_overlap` to false will send out each example in the data once to whoever is available. As a result, your data will have each example labelled only once in total. - **** TL;DR: Prevents duplicate annotation data - *** `port` AS `9000`: Changes the default port from `8080` to `9000`. - *** `host` AS `"0.0.0.0`: Prodigy sets the host as `localhost` by default. The default is normally the IP address assigned to the "loopback" or local-only interface. However, because we are deploying our instance to Kubernetes for orchestration, we need an agnostic IP address. In short, `0.0.0.0` means means "listen on every available network interface." This is required for deployment. - *** `"db_settings": {"sqlite": {"name": "team_name.db","path": "/workspace/.prodigy"}}` - **** In short, we want a new database for each team. I specify the path to where the database either exists or SHOULD exist (if does not currently). The name of the database will become the name of the team requesting the space. Should the database exist, it will simply point to it and connect users. If it does not exist, upon deployment of the K8s pod, it will be created. - -The way the multi-annotator approach here works is that a dataset will be saved in the SQLite database with `-ALIAS` after the name of the dataset specified in the launch command (handled by Justin). For example, let's imagine annotators kamstut, gould29, and srodenb are collaborating on an NER project for Purdue. I would use a command like below as the deployment command to launch the Prodigy instance on Geddes: - -```console -prodigy ner.manual purdue_ner_dataset blank:en /workspace/data/tdm/TEAM_DATA.jsonl --label PERSON,ORG,PROD,LOC -``` - -Where the dataset name is `purdue_ner_dataset`, the data to annotate are `/workspace/data/tdm/TEAM_DATA.jsonl` for the following labels: `PERSON,ORG,PROD,LOC`. - -This means that in our database, if all 3 annotators annotate data, we will have datasets that look like: - -* `purdue_ner_dataset-kamstut` -* `purdue_ner_dataset-srodenb` -* `purdue_ner_dataset-gould29` - -I have a https://github.com/TheDataMine/annotations-infrastructure[script available on GitHub] to comine multiple annotators' datasets into one, so you can leverage all the work with pre-built training recipes and commands. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/jax-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/jax-teachingprogramming.adoc deleted file mode 100644 index dc1630d26..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/jax-teachingprogramming.adoc +++ /dev/null @@ -1,129 +0,0 @@ -= JAX - -== Overview & References - -JAX is a Google research project built upon native Python and NumPy functions to improve machine research learning. The https://github.com/google/jax[official JAX page] describes the core of the project as "an extensible system for composable function transformations," which means that JAX takes the dynamic form of Python functions and converts them to JAX-based functions that work with gradients, backpropogation, just-in-time compiling, and other JAX augmentations. - -[NOTE] -==== -JAX deals with more complex ideas such as neural networks and XLA, which are based in linear algebra and compilers, topics that are more advanced than much of what we cover in projects. The following is a list of incredibly useful resources for learning the foundations of JAX. - -- https://github.com/google/jax[The GitHub JAX page]. We linked this earlier, but it's your best starting point. Everything you need to understand the project is here and you can branch into Autograd, XLA, neural networks, TPUs, and anything else you might want to understand. Here's a https://www.youtube.com/watch?v=0mVmRHMaOJ4[Google Cloud Tech YouTube video] that talks through the content of the GitHub page. - -- https://www.youtube.com/watch?v=WdTeDXsOSj4[This TensorFlow video] provides a slower, in-depth look at _how_ the important features of JAX operate. - -- There's an excellent video series on neural networks and deep learning from https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi[3Blue1Brown] that explains how linear algebra creates the foundation for neural networks which, for our purposes, explains why `grad` is so important when using JAX. - -- As always, https://jax.readthedocs.io/en/latest/index.html[library documentation] is integral to understanding the inner workings and specifics of your code. -==== - -{sp}+ - -=== Basic Deep Learning: JAX Edition - -Google lists the following code at the top of their JAX page: - -[source,python] ----- -import jax.numpy as jnp -from jax import grad, jit, vmap - -def predict(params, inputs): - for W, b in params: - outputs = jnp.dot(inputs, W) + b - inputs = jnp.tanh(outputs) # inputs to the next layer - return outputs # no activation on last layer - -def loss(params, inputs, targets): - preds = predict(params, inputs) - return jnp.sum((preds - targets)**2) - -grad_loss = jit(grad(loss)) # compiled gradient evaluation function -perex_grads = jit(vmap(grad_loss, in_axes=(None, 0, 0))) # fast per-example grads ----- - -This short example provides the two main functions of a deep learning algorithm, `predict` and `loss`, adapted for JAX functionality. We'll break down the code segment as an entry analysis of both JAX and deep learning: - -- `jax.numpy` is JAX's adapted version of the NumPy API, created to prevent standard NumPy functionality from breaking JAX functions when the two packages differ. Make sure to use `jax.numpy` functions instead of regular `numpy` functions. -- `jax` is the main library, from which important functions like `grad`, `jit`, `vmap`, and `pmap` are used. -- `predict` simulates the neural network's predictions based on the dot product of the weights and activation values added to the biases, all of which are given in the `params` parameter. The next layer of neurons is then calculated using the current layer, eventually returning the last layer when `params` is fully processed. -- `loss` uses standard mean-squared error loss calculation, using the current `predictions` and comparing them with `targets` that the user defines. - -This mirrors standard NumPy deep learning very closely, but JAX shortens the runtime in very important ways which we soon describe. - -{sp}+ - -=== Runtime Optimization - -==== `jit` - -Autograd and XLA are the two fundamental components of JAX, with XLA (accelerated linear algebra) handling the runtime and compiling aspects of JAX. Take the following example, adapted from the JAX page: - -[source,python] ----- -def slow_f(x): - # Element-wise ops see a large benefit from fusion - return x * x * x + x * 2.0 * x + x - -x = jnp.ones((2000, 2000)) -fast_f = jit(slow_f) -%timeit -n10 -r3 fast_f(x) -%timeit -n10 -r3 slow_f(x) ----- - ----- -3.97 ms ± 2.53 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) -52.1 ms ± 1.83 ms per loop (mean ± std. dev. of 3 runs, 10 loops each) ----- - -JAX is designed to work with CPUs, GPUs, and TPUs, each a quicker processor than the last. THe example output comes from the most basic CPU setup, and JAX's `jit` function still ran significantly faster than the native Python function. - -The discussion around compile times and runtimes seems like an arbitrary conversation when we're dealing with small datasets -- who cares if my code executes in 5 milliseconds instead of 15? This optimization, however, is vital for neural networks. - -Consider a simple deep learning task of identifying a lowercase letter from an image with 36x36 pixel resolution. The input layer would have 36 * 36 = 1296 neurons and the output layer would have 26 neurons, one for every letter. Without any hidden layers, we're already over 33,000 connections, and in reality, we'd need hidden layers for determining tiny parts to letters, patterns, or some other method for transitioning between image and output. A program that might take an hour on a standard system might now take 30 seconds using TPUs and `jit` compiling -- now the conversation is not arbitrary. - -{sp}+ - -==== `vmap` - -`vmap` is a function that provides "auto-vectorization" for whatever batch you have. Batches are essentially variably-sized samples of your population of training data used in one iteration, after which the model is updated. Imagine the simple solution of looping through every image in your batch, resulting in a vector with the activation values of the image. This vector is then multiplied by the model matrix, resulting in a different matrix. This process works, but it is incredibly slow, as a different intermediate matrix is created with each iteration. - -By using `vmap`, loops are pushed to the most primitive level possible. This speeds up compilation time as iterating over simple elements is quicker than the same with complex elements. For our purposes, this means that the activation vectors are compiled as an activation matrix -- as Google puts it, "at every layer, we're doing matrix-matrix multiplication rather than matrix-vector multiplication." - -The code for this has a unique format. Pay close attention to the following implementation: - -[source,python] ----- -from jax import vmap -predictions = vmap(partial(predict, params))(input_batch) -# or, alternatively -predictions = vmap(predict, in_axes=(None, 0))(params, input_batch) ----- - -`vmap` wraps the `predict` function in parentheses, _then_ takes the parameters and/or input batch wrapped in another set of parentheses. - -{sp}+ - -=== Autodifferentiation - -If you recall the XLA-Autograd duo that composed JAX, autodifferentiation comes from Autograd and shares its API. JAX uses `grad` for calculating gradients, which allows for differentiation to any order. - -We'll recontextualize why this matters for machine learning. The goal of any good model is to reduce the error present -- we obviously want the model to be _good_ at predicting things, otherwise there's no point. The gradient of a function, in this case the error, will indicate the _direction to move_ to minimize the function. In other words, in any-dimensional space, the gradient will tell us which weights in the model need adjusting. - -Once you understand the importance of gradients, the function implementation becomes trivial -- it just takes a number as a parameter to evaluate the gradient at that point. Google gives the example of the hyperbolic tangent function, and we get the following results after using `grad`: - -[source,python] ----- -def tanh(x): # Define a function - y = jnp.exp(-2.0 * x) - return (1.0 - y) / (1.0 + y) - -grad_tanh = grad(tanh) -print(grad_tanh(2.0)) ----- - ----- -0.07065082 ----- - -And that's it! Combining all of the features we've shown will give you a great leap into your machine learning project, and it's all streamlined to make the code easier to follow. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/neural-network-deep-learning-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/neural-network-deep-learning-teachingprogramming.adoc deleted file mode 100644 index fcfac24b3..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/neural-network-deep-learning-teachingprogramming.adoc +++ /dev/null @@ -1,113 +0,0 @@ -= Neural Networks & Deep Learning - -Neural networks are a type of modeling technique that was inspired by neuronal transmission in animal brains. Neural networks are often deployed when the suspected solution is nonlinear; that is to say, a line probably won't represent the expected answer. Deep learning is neural networks with two or more layers. - -== Common Applications - -=== Industries - -- Healthcare -- Tech -- Retail -- Cybersecurity -- Any industry with lots of data, which is increasingly turning into every industry - -=== Problem Types - -- Computer Vision -- Classification Problems -- Speech & Natural Language Processing -- High Dimensional Data - -=== A Brief History - -One of the original inspirations for neural networks was classifying images, something that is relatively easy for humans to do, but turned out to be pretty hard for us to figure out how computers can do. Research on neural networks began in the 1950's, in forms very different from what we work with today; the so called "AI winter" soon began after, and other tools saw large usage and arguably were more powerful (and far simpler) at the time. Approximately in the late 80's, neural networks started to see renewed interest; this catapulted in the early 00's with tech companies like Google and Facebook (with incredible amounts of data) and reignited research that overcame some of the early hurdles. - -== Code Examples - -NOTE: All of the code examples are written in Python, unless otherwise noted. - -=== Containers - -TIP: These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you'll need to run it. https://the-examples-book.com/starter-guides/data-engineering/containers/using-data-mine-containers[Click here to learn why you should be using containers, along with how to do so.] - -TIP: Quickstart: https://docs.docker.com/get-docker/[Download Docker], then run the commands below in a terminal. - -==== Neural Nets Intro: Handwritten Digit Image Classification - -This classic neural network introductory example uses computer vision to classify the handwritten digits. - -[source,bash] ----- -#pull container, only needs to be run once -docker pull ghcr.io/thedatamine/starter-guides:neural-nets-intro - -#run container -docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:neural-nets-intro ----- - -Need help implementing any of this code? Feel free to reach out to mailto:datamine-help@purdue.edu[datamine-help@purdue.edu] and we can help! - -== Resources - -The content here is hand selected by Data Mine staff, and all of it is free for Purdue students (including the book links); most of it should be free for National Data Mine students as well (check your school's digital library resources for the books). - -=== Videos - -https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/video-lecture[Google (~3 minutes)] - -https://www.youtube.com/watch?v=aircAruvnKk[3 Blue, 1 Brown (~20 minutes)] - -https://www.youtube.com/watch?v=jmmW0F0biz0[IBM (~5 minutes)] - -=== Websites - -https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy[Brief Google guide on the motivation for neural networks, with great visual representation] - -https://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html[Brief introduction from UW-Madison] - -https://www.mathworks.com/discovery/neural-network.html[Mathworks (MATLAB) introduction to neural networks] - -https://www.tensorflow.org/tutorials/[TensorFlow Tutorials] - -=== Books - -https://www.statlearning.com[Introduction to Statistical Learning (ISL)] - -Also known as the "machine learning bible", this book is very well known and highly recommended and very clear, *and has numerous code projects attached with each chapter*. Chapter 10 is the deep learning chapter. - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/kov9gv/alma99169839657501081[Introduction to Deep Learning and Neural Networks with Python] - -This book is a gentle introduction to neural networks with plenty of examples, and also includes documentation on how to get your coding environment set up too. - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/kov9gv/alma99169573376001081[Neural Networks and Statistical Learning] - -This book is great for people who have experience with neural networks, but want to get a better feel for the math/theory background. A calculus and linear algebra background is necessary to make sense of this book. - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/kov9gv/alma99169793279001081[Neural Networks] - -Gentle introduction; good for visual learners. - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207647701081[Strengthening Deep Neural Networks] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170253257501081[Fundamentals of Deep Learning] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170208650601081[Deep Learning] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170491905401081[Generative Deep Learning] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207503001081[Deep Learning From Scratch] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207656001081[Deep Learning Cookbook] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208550801081[Deep Learning For Coders] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207842401081[Grokking Deep Learning] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207842801081[Deep Learning and the Game of Go] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208150901081[Tensorflow for Deep Learning] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207199401081[Learning TensorFlow] - -https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207722701081[Practical Deep Learning for Cloud, Mobile and Edge] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/preprocessing-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/preprocessing-teachingprogramming.adoc deleted file mode 100644 index 2d6567bd1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/preprocessing-teachingprogramming.adoc +++ /dev/null @@ -1,128 +0,0 @@ -= Preprocessing -:page-mathjax: true - -Preprocessing is getting data ready for analysis. xref:data-modeling/process/wrangling.adoc[We've gotten ahold of our data], xref:data-modeling/process/eda.adoc[looked at it to confirm its approximate properties and condition], xref:data-modeling/process/think-output.adoc[and thought about what our output should look like]. Sometimes, there is extensive preprocessing that has to be done; other times this step could almost be skipped over. Here, our goal is to get data ready so that when we tell our model to train, it has the data cleaned, in the right format, right shape, etc so it can train correctly. - -== Common Preprocessing Tasks - -Below is a list of possible actions you may need to do during preprocessing along with a brief description. - -=== Data Cleaning - -==== Missing Values - -During EDA we might discover missing/NaN/null data. Among many possible choices, we can remove rows which have missing data; we can remove whole columns that have little data; or we can *impute* the values, where we estimate what they likely will be depending on the values of the data we do have. Check out the https://pandas.pydata.org/docs/user_guide/missing_data.html[pandas guide on dealing with missing data], or https://scikit-learn.org/stable/modules/impute.html[Scikit-Learn documentation on imputation] to learn about ways you can deal with missing values. - -==== Data Formats - -Did we discover that our data format might cause some problems for our xref:data-modeling/process/think-output.adoc[output]? Now is the time to fix any data format issues. - -==== Data Types - -During EDA, we should have looked at all the feature data types to ensure they make sense. For instance, a feature for zip codes might be fine using an int type, until a zip code with a dash (-) in it gets added- at which point it should be a string. We have to make sure all our data types make sense both now, and with how they will get used further on down the line. - -=== Normalization - -Normalization is the process of setting all your numerical values on a similar scale, often 0 to 1 (or less commonly, -1 to 1). This improves the performance and stability of training a model. As an example, with image data, often you will see all the numbers in the image array divided by 255.0. This effectively puts all the numerical values in the image array between 0 and 1, and hence is normalization. - -https://developers.google.com/machine-learning/data-prep/transform/normalization[This Google article has a list of normalization techniques]. Most often, scaling to a range is used. - -=== Cross Validation - -Once the data is cleaned and in the right format, we can set up cross validation, such as with xref:data-modeling/resampling-methods/cross-validation/train-valid-test.adoc[training, validation, and testing splits]. - -=== Augmentation - -Data augmentation is a technique used to prevent xref:data-modeling/general-principles/bias-variance-tradeoff.adoc[overfitting]. The general idea with augmentation is to modify our data in simple ways to help the model training generalize. For instance, augmentation is commonly used in computer vision problems, such as classifying images of animals: here, it may make sense to randomly flip all the images, because we want our model to detect the animals, not the landscape orientation. Some common augmentation steps include: - -Common Visual Augmentations: - -- Random Rotation -- Flipping (Horizontal/Vertical) -- Color Channel Conversion (such as RGB to grayscale) -- Random Brightness -- Random Cropping -- Random Stretching -- Random Contrast -- Random Deletion (remove random pixels from the data) -- Filter Applications (such as https://en.wikipedia.org/wiki/Sobel_operator[Sobel Filters]) - -Common Text Augmentations: - -- Random Insertion (randomly insert words) -- Random Deletion (randomly delete words) -- Shuffling (Shuffle words/sentences randomly) -- Synonym Replacement (replace words with synonyms) -- Paraphrasing (state the sentence in different words) - -Common Audio Augmentations: - -- Random Noise Injection (add random noises in) -- Change Speed (faster or slower) -- Random Pitch (randomly change the pitch) - -Many data science packages have data augmentation functions built into them, for instance https://www.tensorflow.org/tutorials/images/data_augmentation[processing images with Tensorflow], https://librosa.org/doc/main/index.html[processing audio data with Librosa], or https://nlpaug.readthedocs.io[nlpaug for both audio and text augmentation]. - -=== Feature Engineering/Selection - -We have all of these features/variables/dimensions/etc, but do we really need them? Feature engineering is all about figuring out which features matter, and removing the ones that don't. There are many ways to select features: - -- During EDA you might have noticed many missing values in some columns, and decided its better to remove them -- You use PCA (xref:data-modeling/general-principles/curse-of-dimensionality.adoc[see the power of PCA notebook example for a demonstration of how to use PCA]) to order features based on how much they vary (and/or remove them based off of their order) -- During EDA you discover that some of these variables are irrelevant to your output/model building/analysis -- During EDA you realize you have no way of figuring out what some of the features mean, so you remove them - -How you go about selecting features is up to you and contingent on the problem itself. Two people with the same data and ultimate goal can select wildly different features and still produce valuable insights! - -=== Regularization (L1 and L2) - -There are two different kinds of regularization, so called $L_1$ and $L_2$ regularization; they are intended to penalize complex models. They differ in how they penalize the weights in the model: - -- $ L_1 $ penalizes $ |weight| $ -- $ L_2 $ penalizes $ weight^2 $ - -$L_1$ https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization[is often used] for sparsity or xref:data-modeling/general-principles/curse-of-dimensionality.adoc[Curse of Dimensionality]. - -$L_2$ https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization[is often used] to prevent xref:data-modeling/general-principles/bias-variance-tradeoff.adoc[overfitting]. - -=== Data Shapes - -Here, we make sure the data is in the right shape for model building. The shape of your data here will depend on what kind of model building you are doing. - -=== Flattening/Vectorization/Reshaping - -Commonly, reshaping means going from a higher dimension to a lower dimension, not just expanding or contracting the same shape. For instance, this can mean going from 2 dimensions to 1. You can see a demonstration of this in the https://the-examples-book.com/starter-guides/data-science/data-analysis/nndl/neural-network-deep-learning[neural network introduction notebook]; the shape required for the neural network training at the start is (784, ), which is a 1 dimensional array of 784 numbers. We reshaped the data from 28*28 pixel images instead to a 1 dimensional sequence of 784 pixels (hence the "flattening": going one dimension lower). Sometimes, you will see this called "vectorization" when you convert the original shape into a vector. - -=== Data Labeling - -If you have data that needs labels applied, this is where you'd do it. Often for machine learning, labeling is set to be a vector that is the same length as the data, with only one dimension: the corresponding label for each data point. - -== Encoding - -Encoding is just converting a non-numeric data type into a numeric type. For instance, if we have a column *Nation* it might make sense to convert it into 0 for Afghanistan, 1 for Albania, 2 for Algeria, etc. Sometimes you will see this called "categorical encoding", "categorical labeling", etc. A specific type of encoding, called one-hot encoding, is used to create new columns that use a boolean True/False to represent the original variable. - -== Code Examples - -NOTE: All of the code examples are written in Python, unless otherwise noted. - -=== Containers - -TIP: These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you'll need to run it. https://the-examples-book.com/starter-guides/data-engineering/containers/using-data-mine-containers[Click here to learn why you should be using containers, along with how to do so.] - -TIP: Quickstart: https://docs.docker.com/get-docker/[Download Docker], then run the commands below in a terminal. - -[source,bash] ----- -#pull container, only needs to be run once -docker pull ghcr.io/thedatamine/starter-guides:preprocessing - -#run container -docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:preprocessing ----- - -Need help implementing any of this code? Feel free to reach out to mailto:datamine-help@purdue.edu[datamine-help@purdue.edu] and we can help! - -== Our Sources - -- https://www.techtarget.com/searchdatamanagement/definition/data-preprocessing[Data Preprocessing (TechTarget)] -- https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/[Data Preprocessing in Data Mining (Geeks for Geeks)] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/project01.ipynb b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/project01.ipynb deleted file mode 100644 index 83b4425ac..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/project01.ipynb +++ /dev/null @@ -1,509 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "432fbc61-b43a-4d3a-b1a8-e7cf3a3f50a4", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-04647ed3078849a3", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "# Introduction to Artificial Intelligence and Machine Learning (AI/ML)" - ] - }, - { - "cell_type": "markdown", - "id": "1a37d58f-522a-4ed5-b5ca-bf7daec5a816", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-55a4349255e68406", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "## Project Objectives\n", - "- Define AI/ML and the difference between them\n", - "- Compare ML vs. expert based systems\n", - "- Explore ways in which AI/ML gets used\n", - "- Do the \"Hello World\" of neural networks as an introduction to AI/ML" - ] - }, - { - "cell_type": "markdown", - "id": "dad622a6-44bd-4100-b0c1-cdfce690e587", - "metadata": {}, - "source": [ - "## Our Sources" - ] - }, - { - "cell_type": "markdown", - "id": "9a4a4b92-9b5b-4285-bf3e-d1f88cf1869f", - "metadata": {}, - "source": [ - "- What is Artificial Intelligence? (IBM)\n", - "- What is Machine Learning? (IBM)\n", - "- Russell & Norvig (1995)\n", - "- https://www.statlearning.com/" - ] - }, - { - "cell_type": "markdown", - "id": "df30f340-3f62-4698-9a2e-bb60f813588f", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-f608626911043cd8", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "## AI and ML: Definitions and Differences (1 point)" - ] - }, - { - "cell_type": "markdown", - "id": "41f0e227-c303-4e9b-8ee4-588beffacd2f", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-c528559e2f8a47c4", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "Read from both of these links on the difference (or lack thereof) between AI and ML:\n", - "\n", - "- What is Artificial Intelligence? (IBM)\n", - "- What is Machine Learning? (IBM)\n", - "\n", - "**In 1-2 sentences, describe the difference (or lack thereof) between machine learning and artificial intelligence in the cell below.** Citations are not required." - ] - }, - { - "cell_type": "markdown", - "id": "215d421e-4d39-4e8f-a914-5aa1a24dcc1c", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "cell-163511a9d3dcbe0f", - "locked": false, - "points": 1, - "schema_version": 3, - "solution": true, - "task": false - } - }, - "source": [] - }, - { - "cell_type": "markdown", - "id": "8cc40505-fe0a-40d4-8161-59f6ede7b5c3", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-3a54407438cde880", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "## Expert Based Systems (1 point)" - ] - }, - { - "cell_type": "markdown", - "id": "e7a721bc-92bb-4b60-b7db-9c3dcba712d3", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-af281b4297b9e8d2", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "Read from this AI textbook from Russell & Norvig (1995) on what an **expert system** is. Then describe the difference between an expert system and machine learning. The relevant section is titled *Knowledge-based systems: The key to power? (1969-1979)* and starts at page 22 (you should only need to read this one section).\n", - "\n", - "**In a few sentences, describe the difference between machine learning and expert systems in the cell below**. Citations are not required." - ] - }, - { - "cell_type": "markdown", - "id": "8d3bf203-393b-4122-aedb-1236864ed657", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "cell-db20e67f4d5c8f9d", - "locked": false, - "points": 1, - "schema_version": 3, - "solution": true, - "task": false - } - }, - "source": [] - }, - { - "cell_type": "markdown", - "id": "6f74378d-3198-45cc-82ad-263cfb6d53b9", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-c4da309651fb519b", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "## Use Cases For AI/ML (1 point)" - ] - }, - { - "cell_type": "markdown", - "id": "048a1ce3-04c5-4728-8cb2-51e2d1632103", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-ad70b7d703774a53", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "Go back to the link What is Machine Learning? (IBM) and scroll down to the \"Real-world machine learning use cases\" section.\n", - "\n", - "**In 1-3 sentences, explain which one of the given use cases interest you the most and why**. Citations are not required." - ] - }, - { - "cell_type": "markdown", - "id": "5c7ac2e5-9f04-4886-a16a-570331163cd6", - "metadata": { - "nbgrader": { - "grade": true, - "grade_id": "cell-78f3f8e29551d5fb", - "locked": false, - "points": 1, - "schema_version": 3, - "solution": true, - "task": false - } - }, - "source": [] - }, - { - "cell_type": "markdown", - "id": "1a5c268a-62c2-495b-bcea-33cad95f478c", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-7f8c7301f61021a7", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "## A Machine Learning Algorithm: $k$-nearest neighbors (knn)" - ] - }, - { - "cell_type": "markdown", - "id": "003fa6d4-cf30-43ed-a9b0-a05c9900ceb7", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-5722638f350edea4", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - }, - "tags": [] - }, - "source": [ - "Below we implement the $k$-nearest neighbors algorithm using Scikit-Learn, a machine learning package. \n", - "\n", - "*You might not understand what most of this code is doing, and we don't expect you to! The entire algorithm is mostly implemented for you; all you need to do is edit a few lines of code to finish it.* **There will be clear instructions at the two points where you need to edit the code to get it to work.**\n", - "\n", - "If you are interested in learning more about $k$-nearest neighbors, check out chapter 2.2.3 of ISL: https://www.statlearning.com/ or visit the Starter Guide page (this code comes directly from the `knn` code example)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e02dbf27-865c-4b66-ba20-544ebe3aac04", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-eb2dd79430b490af", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import openpyxl\n", - "import numpy as np\n", - "from sklearn.neighbors import KNeighborsClassifier\n", - "from sklearn import metrics\n", - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "import warnings\n", - "import math\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "warnings.filterwarnings('ignore') #ignore warnings that occur" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9f4dc9a5-ec3b-412a-83c8-9798fc4000fb", - "metadata": {}, - "outputs": [], - "source": [ - "df = pd.read_excel(\"data.xlsx\")\n", - "df = df.dropna()\n", - "columns_to_convert = ['satisfaction_v2','Gender','Customer Type','Type of Travel','Class']\n", - "\n", - "for column in columns_to_convert:\n", - " df[column] = df[column].astype('category')\n", - " df[column+\"_coded\"] = df[column].cat.codes\n", - "\n", - "old_df = df\n", - "\n", - "df = df.drop(columns=['id'])\n", - "df = df.drop(columns=columns_to_convert)\n", - "\n", - "columns_to_norm = ['Age','Flight Distance','Departure Delay in Minutes','Arrival Delay in Minutes']\n", - "\n", - "for column in columns_to_norm:\n", - " df[column] = df[column]/np.max(df[column])" - ] - }, - { - "cell_type": "markdown", - "id": "5cdbfdb9-774e-4552-8bf7-fff0c4b3e56a", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-2bde08245920e4f3", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "#### Train Test Splits (1 point)" - ] - }, - { - "cell_type": "markdown", - "id": "368edeb0-d4e1-4071-8fca-36f5580c4104", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-670e850c6791b99a", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "Below we create the cross validation train test splits from our data. You can learn about train/test splits here: https://the-examples-book.com/starter-guides/data-science/data-modeling/resampling-methods/cross-validation/train-valid-test\n", - "\n", - "**Set a float called `test_size` to be some value between 0.05 and 0.30 to create a test split that is 5-30% of our total dataset**. `test_size` gets used in the scikit-learn function `train_test_split` to automatically shuffle our data and create train test splits." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a5c2d5db-a38f-4b2a-93d9-93f3f41bd6ff", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "test_size_answer", - "locked": false, - "schema_version": 3, - "solution": true, - "task": false - } - }, - "outputs": [], - "source": [ - "test_size = ??" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92ae52dd-d20b-4fbb-b2b5-621bfe696fd1", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-223229102718b2e5", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "outputs": [], - "source": [ - "labels = df['satisfaction_v2_coded'] #create the labels \n", - "data = df.drop(columns=['satisfaction_v2_coded']) #recreate the data\n", - "train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=test_size, random_state=42)" - ] - }, - { - "cell_type": "markdown", - "id": "694dac77-11eb-42a8-95ad-3d56e9041893", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-18f0f6838c99c4f9", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "#### Setting the Max $k$ Value (1 point)" - ] - }, - { - "cell_type": "markdown", - "id": "0b982a2a-8fc2-43ee-8318-487e9d91a79f", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-ddf4030801b53dbc", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "source": [ - "Below we create a for loop to try out multiple different $k$ values. Here we set the maximum value of $k$. You will want to set your `max_k` value to not be more than 20; it might take a while if you go higher than that, and besides, you will see that this data (like most datasets) doesn't benefit from a $k$ value higher than 10. **Set variable `max_k` to be equal to an int between 1 and 21 of your choice.**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5f4556d-ba3e-4dd7-8ca3-03b66ac63bfd", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "max_k_answer", - "locked": false, - "schema_version": 3, - "solution": true, - "task": false - } - }, - "outputs": [], - "source": [ - "max_k = ??" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f10e3595-70f3-4d24-b856-6de17fd6ab5e", - "metadata": { - "nbgrader": { - "grade": false, - "grade_id": "cell-4216605e48894dd8", - "locked": true, - "schema_version": 3, - "solution": false, - "task": false - } - }, - "outputs": [], - "source": [ - "k_values = []\n", - "train_acc = []\n", - "test_acc = []\n", - "\n", - "#for each possible k we can test from 2 to the max possible k value (including max_k)\n", - "for k in range(2,max_k+1):\n", - "#Train Model and Predict \n", - " print(\"Now testing value of k:\",k)\n", - " neigh = KNeighborsClassifier(n_neighbors = k).fit(train_x,train_y)\n", - " yhat = neigh.predict(test_x)\n", - " k_values.append(k)\n", - " train_acc.append(metrics.accuracy_score(train_y, neigh.predict(train_x)))\n", - " test_acc.append(metrics.accuracy_score(test_y, yhat))\n", - "\n", - "#convert results to df\n", - "results_data = {'k':k_values, 'Training Accuracy':train_acc, 'Test Accuracy':test_acc}\n", - "results_df = pd.DataFrame(data=results_data)\n", - "\n", - "print(\"The k value with the highest accuracy betwen 2 and\", max_k,\"is\",np.argmax(test_acc)+2)\n", - "\n", - "# setting the dimensions\n", - "fig, ax = plt.subplots(figsize=(30, 18))\n", - " \n", - "# drawing the plot\n", - "sns.lineplot(results_df, x='k',y='Test Accuracy', ax=ax).set_title(\"Test Accuracy For Each k Value\")\n", - "plt.show()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "primary_python_env", - "language": "python", - "name": "primary_python_env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/rnn-teachingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/rnn-teachingprogramming.adoc deleted file mode 100644 index d6f673f30..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/rnn-teachingprogramming.adoc +++ /dev/null @@ -1,95 +0,0 @@ -= Recurrent Neural Networks (RNN) - -Recurrent Neural Networks are a type of neural network that uses a bidirectional model architecture, where outputs from nodes affect the incoming inputs to some extent (in contrast to the well known feedforward architecture that neural nets have traditionally used). There are many different styles of implementation, yet they often incorporate sequential or language related data to produce models that can remember and learn from the data and dynamically to decide what information matters during learning and what doesn't. - -== Common Applications - -=== Common Problem Types - -- NLP -- Time Series -- Music/Sound/Audio -- Biological/Genetic - -== A Brief History - -With roots in the 1920's, Amari (1972) is generally credited with being the first to make RNN's *adaptive*, that is, learn to change its outputs given its inputs by changing connection weights. Slow and steady developments in the 70's and 80's gave rise to one of the most popular RNN derivatives, the LSTM ("Long Short-Term Memory"). By 2016, LSTM models accounted for utilizing over a quarter of the total computational resources allotted for neural net inference at Google. - -https://arxiv.org/pdf/2212.11279.pdf[Learn more from our source about the history of RNN's (and NN in general!)] - -== Code Examples - -NOTE: All of the code examples are written in Python, unless otherwise noted. - -=== Containers - -TIP: These are code examples in the form of Jupyter notebooks running in a container that come with all the data, libraries, and code you'll need to run it. https://the-examples-book.com/starter-guides/data-engineering/containers/using-data-mine-containers[Click here to learn why you should be using containers, along with how to do so.] - -TIP: Quickstart: https://docs.docker.com/get-docker/[Download Docker], then run the commands below in a terminal. - -==== Time Series RNN - -A great example from the Tensorflow authors building an RNN using time series data. - -[source,bash] ----- -#pull container, only needs to be run once -docker pull ghcr.io/thedatamine/starter-guides:time-series-rnn - -#run container -docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:time-series-rnn ----- - -==== LSTM (Long Short-Term Memory) - -An implementation of an LSTM model trained on stock data to predict what the value will be in the near future. - -[source,bash] ----- -#pull container, only needs to be run once -docker pull ghcr.io/thedatamine/starter-guides:lstm - -#run container -docker run -p 8888:8888 -it ghcr.io/thedatamine/starter-guides:lstm ----- - -Need help implementing any of this code? Feel free to reach out to mailto:datamine-help@purdue.edu[datamine-help@purdue.edu] and we can help! - -== Resources - -All resources are chosen by Data Mine staff to be of decent quality, and most if not all content is free. - -=== Websites - -- https://www.ibm.com/topics/recurrent-neural-networks[Recurrent Neural Networks (IBM)] -- https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks[Recurrent Neural Networks Cheatsheet (Stanford)] -- https://karpathy.github.io/2015/05/21/rnn-effectiveness/[The Unreasonable Effectiveness of Recurrent Neural Networks (Andrej Karpathy)] -- https://colah.github.io/posts/2015-08-Understanding-LSTMs/[Understanding LSTM's (Christopher Olah)] -- https://towardsdatascience.com/recurrent-neural-networks-rnns-3f06d7653a85[Recurrent Neural Networks (Towards Data Science)] -- https://www.mathworks.com/discovery/rnn.html[What is a Recurrent Neural Network? (MathWorks)] - -=== Videos - -- https://www.youtube.com/watch?v=AsNTP8Kwu80[Recurrent Neural Networks (RNNs), Clearly Explained!!! (StatQuest With Josh Starmer, ~16 minutes)] -- https://www.youtube.com/watch?v=YCzL96nL7j0[Long Short-Term Memory (LSTM), Clearly Explained (StatQuest With Josh Starmer, ~21 minutes)] -- https://www.youtube.com/watch?v=b61DPVFX03I[What is LSTM (Long Short Term Memory)? (IBM, ~8 minutes)] -- https://www.youtube.com/watch?v=LHXXI4-IEns[Illustrated Guide to Recurrent Neural Networks: Understanding the Intuition (The A.I. Hacker - Michael Phi, ~10 minutes)] -- https://www.youtube.com/watch?v=WCUNPb-5EYI[Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) (~25 minutes)] -- https://www.youtube.com/watch?v=Y2wfIKQyd1I[What is Recurrent Neural Network (RNN)? (~16 minutes)] -- https://www.youtube.com/watch?v=DFZ1UA7-fxY[Recurrent Neural Networks : Data Science Concepts (~27 minutes)] - -=== Books - -- https://www.statlearning.com[Introduction to Statistical Learning (ISL)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170200340801081[Recurrent neural networks: from simple to gated architectures (2022)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170398531201081[Recurrent neural networks: concepts and applications (2023)] - -=== Articles - -- https://arxiv.org/pdf/2212.11279.pdf[The Road To Modern AI (2022)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_crossref_primary_10_1162_neco_1997_9_8_1735[Long Short-Term Memory (1997)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_crossref_primary_10_1016_j_petrol_2019_106682[Time-series well performance prediction based on Long Short-Term Memory (LSTM) neural network model (2020)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_doaj_primary_oai_doaj_org_article_e6777fc0a9164c74997b527270e53e33[Long Short-Term Memory Neural Networks for Online Disturbance Detection in Satellite Image Time Series (2018)] -- https://web.stanford.edu/~jurafsky/slp3/9.pdf[RNN's and LSTM's (2023)] -- https://arxiv.org/pdf/1909.09586.pdf[Understanding LSTM a tutorial into Long Short-Term Memory Recurrent Neural Networks (2019)] -- https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_crossref_primary_10_1109_T_C_1972_223477[Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements (1972)] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/trainingtrainingprogramming.adoc b/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/trainingtrainingprogramming.adoc deleted file mode 100644 index 6bd9eda7f..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/30200_40200/trainingtrainingprogramming.adoc +++ /dev/null @@ -1,43 +0,0 @@ -= Training Your Model -:page-mathjax: true - -== Introduction - -Training (sometimes called fitting) your model is when you build out the implementation of your modeling technique, connect data to it, and press start. Here we explore what this process entails. - -== Planning the Architecture - -Here we plan the structure of our model. Often, this is where we determine the approximate shapes that $\hat{f}$ is likely to take on and plan our architecture in accordance- you will recall that shaping is heavily influenced by xref:data-modeling/choosing-model/parameterization.adoc[how we parameterize], if we choose to do so. For instance, in a regression model, this often takes the form of determining if a simple or curvilinear line is needed, detecting possible interaction effects, etc. - -If you have hyperparameters (also called tuning parameters), this is the place to list them out and consider possible ranges for them. Neural networks are a great example here because they often have so many hyperparameters, including the number of hidden layers, number of neurons in each layer, the activation functions used, what type of gradient descent algorithm and its associated optimization, and much more. We need to come up with a way to decide on how to the different tuning paramaters; sometimes this can amount to testing out numerous different models and seeing what works. - -This step is where the planning in xref:data-modeling/process/think-output.adoc[thinking about the output] comes into play. If we are trying to imagine what the approximate structure for our model should be, how can we do that if we don't know what we want to see as output? - -With relatively simple models, maybe going right ahead to code and see what works is a great idea. Other times, with more complex models and lots of data, writing out our model architecture is not a bad idea. Here's a rough template to give you an idea: - -- What Do We Want To See As Output: -- Model(s) To Build: -- Parameters in the model(s): -- Hyperparameters (and optimization of those hyperparameters, if applicable): - -== Code Implementation - -Once you've got your model architecture sketched, now we implement this in code. Packages/libraries differ in implementation, but there are fairly robust packages that make architecture implementation fairly straightforward once you learn how they work. A great example of code implementation with a few hyperparameters for a neural network using Tensorflow can be seen in the https://the-examples-book.com/starter-guides/data-science/data-analysis/nndl/neural-network-deep-learning[neural network introduction notebook]. - -== Training/Fitting - -Sometimes you will see the training/fitting process called "learning", and this is where the notion of machine "learning" comes from: our machine is learning/training/fitting its model based on the training data. - -Again the implementation details differ wildly depending on which package/library you are using. The model will use the validation set to verify and/or optimize itself along the way, depending on your architecture and model choice. - -Most of the time its a single line of code which starts the training process. Again, you can see a simple example of this on the https://the-examples-book.com/starter-guides/data-science/data-analysis/nndl/neural-network-deep-learning[neural network introduction notebook]. - -By the end of this process, your model will have its parameters discovered/chosen, or if you are using an unsupervised method, it will have taken at least 1 complete iteration in its algorithm. Unsupervised methods can return differing results training on the same data and running many iterations, such as clustering. You can see an example of this process below, where no new data is added, but the algorithm determines that the clusters are slightly different at each iteration until stopping. - -image::K-means_convergence.gif[] - -== Testing - -Once we are finished training, and we've xref:data-modeling/process/measure-fit.adoc[assessed the model's accuracy] and we are satisfied with its training, we can test the model to see how well it performs on data that is unbiased. You may recall that xref:data-modeling/resampling-methods/cross-validation/train-valid-test.adoc[although its optional, the highly recommended test split] can be used to test a model after its been trained. The reason why we set aside this test split is so we have data that was completely unused for the model training process. You will recall that the training data was used to train the data; the validation data was used to validate (and/or optimize the tuning parameters) the training results during training. But the test split was not used at all for training, hence why it is considered "unbiased". - -You can think of this step sort of like a new car leaving the factory. It ought to be test driven for at least a few miles before being sold to a customer, just to verify that the basic things are right. Often, testing is done on samples of data, and metrics are taken again; for instance, what percentage of images are correctly classified as cats in a test set of dog and cat images? Here we get to see the real results of our labor. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure01.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure01.webp deleted file mode 100644 index 943ccb589..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure01.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure02.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure02.webp deleted file mode 100644 index 781a10761..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure02.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure03.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure03.webp deleted file mode 100644 index 7b10b6538..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure03.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure04.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure04.webp deleted file mode 100644 index e836fd479..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure04.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure05.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure05.webp deleted file mode 100644 index a6298c950..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure05.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure06.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure06.webp deleted file mode 100644 index 4c543c1ed..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure06.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure07.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure07.webp deleted file mode 100644 index 206ad2fb9..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure07.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure08.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure08.webp deleted file mode 100644 index df664269e..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure08.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure09.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure09.webp deleted file mode 100644 index 3928998ac..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure09.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure10.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure10.webp deleted file mode 100644 index 1e9910f81..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure10.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure11.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure11.webp deleted file mode 100644 index 9ea314a0e..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure11.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure12.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure12.webp deleted file mode 100644 index 905bc1de7..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure12.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure13.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure13.webp deleted file mode 100644 index c9690ef1d..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure13.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure14.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure14.webp deleted file mode 100644 index 7773bc4ba..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure14.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure15.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure15.webp deleted file mode 100644 index 7a1fc82cb..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure15.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure16.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure16.webp deleted file mode 100644 index 7eef43f50..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure16.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure17.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure17.webp deleted file mode 100644 index 0a899198f..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure17.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure18.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure18.webp deleted file mode 100644 index c0f15eb3e..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure18.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure19.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure19.webp deleted file mode 100644 index 4e8335939..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure19.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure20.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure20.webp deleted file mode 100644 index 5625a90a2..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure20.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure21.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure21.webp deleted file mode 100644 index 08b955b56..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure21.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure22.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure22.webp deleted file mode 100644 index ec1850e8e..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure22.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure23.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure23.webp deleted file mode 100644 index 516ce478a..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure23.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure24.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure24.webp deleted file mode 100644 index 69b38477d..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure24.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure25.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure25.webp deleted file mode 100644 index 3b0daa1b4..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure25.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure26.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure26.webp deleted file mode 100644 index a8c6c507f..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure26.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure27.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure27.webp deleted file mode 100644 index fe0db74b3..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure27.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure28.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure28.webp deleted file mode 100644 index 79de2ddf5..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure28.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure29.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure29.webp deleted file mode 100644 index cf915d268..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure29.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure30.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure30.webp deleted file mode 100644 index 120209141..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure30.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure31.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure31.webp deleted file mode 100644 index 923057bdb..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure31.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure32.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure32.webp deleted file mode 100644 index 4d482bd62..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure32.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure33.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure33.webp deleted file mode 100644 index 3a67633f3..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure33.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure34.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure34.webp deleted file mode 100644 index 0f6ee4836..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure34.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure35.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure35.webp deleted file mode 100644 index 787f79c06..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure35.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure36.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure36.webp deleted file mode 100644 index 41b8ec38a..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure36.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure37.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure37.webp deleted file mode 100644 index 71e04e972..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure37.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure38.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure38.webp deleted file mode 100644 index 1fdb14c42..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure38.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure39.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure39.webp deleted file mode 100644 index 29479d146..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure39.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure40.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure40.webp deleted file mode 100644 index cbf06a28b..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure40.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure41.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure41.webp deleted file mode 100644 index 3e2b0cccb..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure41.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure42.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure42.webp deleted file mode 100644 index 14cfb3641..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure42.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure43.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure43.webp deleted file mode 100644 index 62ba1f9cc..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure43.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure44.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure44.webp deleted file mode 100644 index db2ac1e34..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure44.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure45.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure45.webp deleted file mode 100644 index 1f5c7fc60..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure45.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure46.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure46.webp deleted file mode 100644 index 9aca55343..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure46.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure47.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure47.webp deleted file mode 100644 index c41e60b32..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure47.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure48.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure48.webp deleted file mode 100644 index 3eb504bdd..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure48.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure49.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure49.webp deleted file mode 100644 index 81e35f20b..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure49.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure50.gif b/projects-appendix/modules/ROOT/pages/spring2024/images/figure50.gif deleted file mode 100644 index 0cb4324bd..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure50.gif and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure51.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure51.webp deleted file mode 100644 index eda139648..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure51.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/figure52.webp b/projects-appendix/modules/ROOT/pages/spring2024/images/figure52.webp deleted file mode 100644 index 2c0e95cda..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/figure52.webp and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_102.png b/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_102.png deleted file mode 100644 index 0b9a4a1e9..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_102.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_202.png b/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_202.png deleted file mode 100644 index 60722ea33..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_202.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_302.png b/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_302.png deleted file mode 100644 index b2389fb36..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_302.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_402.png b/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_402.png deleted file mode 100644 index 48b6be129..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/office_hours_402.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/images/temp_102.png b/projects-appendix/modules/ROOT/pages/spring2024/images/temp_102.png deleted file mode 100644 index 3da1768da..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/images/temp_102.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/102_TAs.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/102_TAs.adoc deleted file mode 100644 index 4f0880fd9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/102_TAs.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= TDM 102 T.A.s - Spring 2024 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -== Student-facing T.A.s: - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find the office hours schedule on the xref:spring2024/office_hours_102.adoc[*Office Hours*] page. -==== - -- Adarsh Rao -- Bharath Sadagopan -- Brennan Frank -- Crystal Mathew -- Chaewon Oh -- Daniel Lee -- Hpung San Aung -- Minsoo Oh -- Nihar Atri -- Rhea Pahuja -- Sabharinath Saravanan -- Samhitha Mupharaphu -- Sanjhee Gupta -- Sharan Sivakumar -- Shree Krishna Tulasi Bavana -- Shreya Ippili -- Shrinivas Venkatesan -- Ta-Yuan Sun (Derek) -- Vivek Chudasama -- Yifei Jin - -== Graders: - -- Connor Barnsley -- David Martin Calalang -- Dheeraj Namargomala -- Gaurav Singh -- Mridhula Srinivasan -- Shriya Gupta -- Tushar Singh - ---- - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/202_TAs.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/202_TAs.adoc deleted file mode 100644 index fa3189f38..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/202_TAs.adoc +++ /dev/null @@ -1,42 +0,0 @@ -= TDM 202 T.A.s - Spring 2024 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -== Student-facing T.A.s: - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find the office hours schedule on the xref:spring2024/office_hours_202.adoc[*Office Hours*] page. -==== - -- Ananya Goel -- Dhruv Shah -- Joseph Lee -- Nikhil Saxena - -== Graders: - -- Aayushi Akhouri -- Haemi Lee -- Jack Secor -- Theeraptra Thongdee -- Tong En Sim (Nicole) -- Tori Donoho - ---- - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/302_TAs.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/302_TAs.adoc deleted file mode 100644 index 927c29141..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/302_TAs.adoc +++ /dev/null @@ -1,32 +0,0 @@ -= TDM 302 T.A.s - Spring 2024 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -== Graders + Student-facing T.A.s: - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find the office hours schedule on the xref:spring2024/office_hours_302.adoc[*Office Hours*] page. -==== - -- Aditya Bhoota -- Ankush Maheshwari -- Brian Fernando - ---- - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/402_TAs.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/402_TAs.adoc deleted file mode 100644 index 217b8b8bb..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/402_TAs.adoc +++ /dev/null @@ -1,30 +0,0 @@ -= TDM 402 T.A.s - Spring 2024 - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -== Grader + Student-facing T.A.: - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find the office hours schedule on the xref:spring2024/office_hours_402.adoc[*Office Hours*] page. -==== - -- Jackson Fair - ---- - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/bookshelf-teachingprogrmamming.doc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/bookshelf-teachingprogrmamming.doc deleted file mode 100644 index b27dee352..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/bookshelf-teachingprogrmamming.doc +++ /dev/null @@ -1,295 +0,0 @@ -= The Data Mine's Bookshelf - -WARNING: This page is still under construction. For now, you might find short hand names of the books. We are working on listing all the Purdue library links here. If you look up any of these books on Purdue's library (which anyone can do, even non Purdue students) you will almost certainly find the book. - -While most of these books are scattered throughout the Starter Guides on their respective topics, they are also listed here under their approximate content domain. All of these books come highly recommended. For Purdue students, most if not all of these books are free at the Purdue library link; for non-Purdue students, a good chunk of them should be free. - -.Data Science - -== General - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169850275601081[Introduction to Data Technologies] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170206375101081[Thinking with Data] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170206001401081[Bad Data Handbook] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170206728901081[Doing Data Science] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208361701081[Becoming a Data Head] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207906501081[Introducing Data Science] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170207834101081[Data Science From Scratch] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207211501081[Learning to Love Data Science] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99343484626401082[Think Like a Data Scientist] - - [ ] https://www.npr.org/2012/10/10/162594751/signal-and-noise-prediction-as-art-and-science[The Signal And The Noise: Why So Many Predictions Fail - But Some Don't] - - [ ] https://cs.nyu.edu/~davise/papers/Ellenberg.pdf[How Not to be Wrong: The Power of Mathematical Thinking] - - [ ] https://ischoolonline.berkeley.edu/data-science/what-is-data-science/[What is Data Science?] - - [ ] https://www.kdnuggets.com/news/top-stories.html[KDNuggets: Machine Learning Articles] - - [ ] https://paperswithcode.com[Research Papers With Code] - - [ ] https://machinelearningmastery.com/start-here/[Machine Learning Mastery: Step by Step Guides] - -== Data Analysis - -=== EDA: Exploratory Data Analysis - - - [ ] https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm[What is EDA?] - - [ ] https://www.itl.nist.gov/div898/handbook/eda/section1/eda14.htm[What are the EDA Goals?] - - [ ] https://r4ds.had.co.nz/exploratory-data-analysis.html[Exploratory Data Analysis (R for Data Science)] - -=== Visualization - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_crossref_primary_10_2307_1390947[An Approach to Providing Mathematical Annotation in Plots] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_proquest_miscellaneous_57612250[Creating More Effective Graphs] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99137093640001081[Elements of Graphing Data] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99169166003201081[Grammar of Graphics] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169166769101081[Graphics of Large Datasets] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_proquest_journals_1311448658[How To Display Data Badly] - - [ ] Maps for advocacy - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma991002030469704601[Visual Display of Quantitative Info] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma9931804101082[Beautiful Evidence] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma9916797701082[Visual Explanations] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99127928770001081[Envisioning Information] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170454182801081[Visualizing Data] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170204137901081[Visualizing Data] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170206494001081[Interactive Data Visualization for the Web] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207571601081[Fundamentals of Data Visualization] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170208420301081[Presenting to Win] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/5imsd2/cdi_crossref_primary_10_14714_CP67_172[The Atlas of the Real World: Mapping the Way We Live] - - [ ] Seeing with fresh eyes - - [ ] Visualizing information for advocacy - - [ ] S plus trellis graphics - - [ ] Making Data Visual - - [ ] Tableau Desktop Cookbook by Lorna Brown (O’Reilly, 2021) - - [ ] Innovative Tableau by Ryan Sleeper (O’Reilly, 2020) - - [ ] Practical Tableau by Ryan Sleeper (O’Reilly, 2018) - - [ ] Communicating Data with Tableau by Ben Jones (O’Reilly, 2014) - - [ ] Tableau Strategies by Ann Jackson and Luke Stanke (O’Reilly, 2021) - - [ ] Tableau Prep: Up & Running by Carl Allchin (O’Reilly, 2020) - - [ ] https://m2.material.io/design/communication/data-visualization.html#principles[Data Visualization Principles] - -=== Analysis Techniques - -==== Spatial Data Analysis - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99169166877001081[Applied Spatial Data Analysis with R] - -==== Computer Vision - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170277260601081[Low Power Computer Vision] - -==== Time Series - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169166711201081[Introductory Time Series with R] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207517701081[Practical Time Series Analysis] - -==== Machine Learning - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170205873301081[Machine Learning for Hackers] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99169166706401081[The Elements of Stat Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170611498701081[Intro to Statistical Learning with Applications Python] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170221019001081[Hands on Machine Learning] - - [ ] Machine Learning - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208241901081[Machine Learning Design Patterns] - - [ ] AI + ML for coders - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207731001081[Building Machine Learning Powered Applications] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207425001081[Real World Machine Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208165901081[Building Machine Learning Pipelines] - - [ ] Reinforcement Learning - -==== Trees - - - [ ] https://xgboost.readthedocs.io/en/latest/tutorials/model.html[XGBoost Documentation] - -==== NLP - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170255082801081[Natural Language Processing with Transformers] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208410301081[Practical Natural Language Processing] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170475945101081[Natural Language Processing with PyTorch] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170449318701081[GPT-3] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208100101081[Natural Language Processing with Spark NLP] - -==== GAMS: Generalized Additive Models - - - [ ] https://multithreaded.stitchfix.com/blog/2015/07/30/gam/[GAM: The Predictive Modeling Silver Bullet] - -==== Neural Networks - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207647701081[Strengthening Deep Neural Networks] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170253257501081[Fundamentals of Deep Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170208650601081[Deep Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170491905401081[Generative Deep Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207503001081[Deep Learning From Sratch] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207656001081[Deep Learning Cookbook] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208550801081[Deep Learning For Coders] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207842401081[Grokking Deep Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207842801081[Deep Learning and the Game of Go] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208150901081[TensorFlow for Deep Learning] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207199401081[Learning TensorFlow] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207722701081[Practical Deep Learning for Cloud, Mobile and Edge] - -==== Optimization - - - [ ] https://developers.google.com/optimization/[OR Tools Optimization] - -=== Specific Subject Analysis - -==== Sports - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208100101081[Baseball Hacks] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170228353501081[Applied Sport Business Analytics] - -==== Biology, Bioinformatics, Forestry - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99169166043401081[Statistical Methods in Bioinformatics] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170205126601081[Developing Bioinformatics Computer Skills] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169000780101081[Bioinformatics Data Skills] - - [ ] Blast - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169513474601081[Modern Statistics for Modern Biology] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207834501081[Deep Learning for Life Sciences] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99169166830401081[Forest Analytics with R] - -== Gathering Data - -=== Data Mining - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170204621301081[Programming Collective Intelligence] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170207598501081[Mining the Social Web] - -.Data Engineering - -== General - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207781801081[97 Things Every Cloud Engineer Should Know] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207781801081[97 Things Every Data Engineer Should Know] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207439201081[Foundations for Architecting Data Solutions] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208687401081[Building Secure and Reliable Systems] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170581389701081[Designing Data Intensive Applications] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207461001081[97 Things Every Engineering Manager Should Know] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207617901081[The Enterprise Big Data Lake] - -== Platforms - -=== Spark - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207345901081[Spark The Definitive Guide] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207166701081[High Performance Spark] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207755301081[Stream Processing with Apache Spark] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207418901081[Advanced Analytics With Spark] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99343455833501082[Learning Spark] - -=== Azure - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207387701081[Mastering Azure Analytics] - -=== Hive - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170206020801081[Programming Hive] - -=== Hadoop - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170206763101081[Hadoop The Definitive Guide] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170205835001081[Hadoop Application Architectures] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/ufs51j/alma99170206913701081[Hadoop in Practice] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207135401081[Data Analytics With Hadoop] - -=== AWS - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208041501081[AWS Cookbook] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208410701081[Migrating to AWS: A Managers Guide] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208683001081[Data Science on AWS] - -=== MapReduce - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170205804101081[Mapreduce Design Patterns] - -=== Kafka - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207580701081[Mastering Kafka Streams] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207589601081[Architecting Modern Data Platforms] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207596101081[Kafka: The Definitive Guide] - -== Containers - -=== Docker - - - [ ] https://docs.docker.com[Docker Documentation] - -=== Kubernetes - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207599301081[Kubernetes Operators] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207876701081[Production Kubernetes] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170208381501081[Kubernetes Best Practices] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170468680201081[Kubernetes Patterns] - - [ ] https://www.cncf.io/phippy/the-childrens-illustrated-guide-to-kubernetes/[Children's Guide to Kubernetes] - -.Methodology - -== Productivity - - - [ ] https://knowledge.wharton.upenn.edu/article/deep-work-the-secret-to-achieving-peak-productivity[Deep Work: Rules for Focused Success in a Distracted World] - - [ ] https://expeed.com/blog-posts/the-importance-of-defining-a-research-goal-in-a-data-science-project/[The Importance of Defining a Research Goal in a Data Science Project] - - [ ] http://www.datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/[Data Science Project Scoping Guide] -== Agile - - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207163201081[Agile Data Science 2.0] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207329701081[Agile for Everybody] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207564001081[97 Things Every Scrum Practicioner Should Know] - - [ ] https://purdue.primo.exlibrisgroup.com/permalink/01PURDUE_PUWL/uc5e95/alma99170207168701081[Learning Agile] - - [ ] Agile project management - - [ ] Agile practice guide - -== Data Ethics - - - [ ] 97 Things about ethics everyone should know - - [ ] https://www.npr.org/2016/09/12/493654950/weapons-of-math-destruction-outlines-dangers-of-relying-on-data-analytics[Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy] - - [ ] https://blogs.lse.ac.uk/medialse/2016/02/05/bittersweet-mysteries-of-machine-learning-a-provocation/[The Black Box Society] - -== Devops - - - [ ] Intro to devops with chocolate, lego - -== Incorporating Diverse Backgrounds - - - [ ] Asked and Answered by Pamela E. Harris and Aris Winger (2020) - - [ ] Practices and Policies by Pamela E. Harris and Aris Winger (2021) - - [ ] Read and Rectify by Pamela E. Harris and Aris Winger (2022) - - [ ] Testimonios by Pamela E. Harris, Alicia Prieto-Langarica, Vanessa Rivera Quiñones, Luis Sordo Vieira, Rosaura Uscanga, and Andrés R. Vindas Meléndez - - [ ] Unleash Different by Rich Donovan (2018) - - [ ] https://data.org/news/why-how-and-what-of-data-science-for-social-impact/[Why, How, and What of Data SCience for Social Impact] - -== Psychology - - - [ ] https://medium.com/12minapp/quiet-the-power-of-introverts-book-summary-bb213ddd9b6d[Quiet: The Power of Introverts in a World That Can't Stop Talking] - -== Version Control - -=== SVN/Subversion - - - [ ] Version Control with Subversion - -=== Git/Github - - - [ ] Learn git in a month of lunches - - [ ] Building tools with Github - - [ ] Git for Teams - - [ ] Version Control with Git - -.Miscellaneous Tools - -== Raspberry Pi - - - [ ] Raspberry Pi cookbook - -== Open Source - - - [ ] Data analysis with open source tools - -== Command Line - - - [ ] Data science at the command line - -== Unix - -=== GNU - - - [ ] Learning GNU Emacs - -=== Tools - - - [ ] Flex and Bison diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPYPlan.pdf b/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPYPlan.pdf deleted file mode 100644 index 09d03a59e..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPYPlan.pdf and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPythonPlan.pdf b/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPythonPlan.pdf deleted file mode 100644 index b2fa7cd25..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2024/logistics/linearPythonPlan.pdf and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours.adoc deleted file mode 100644 index 85e2d06ab..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours.adoc +++ /dev/null @@ -1,11 +0,0 @@ -= TA Office Hours - -Please select the level you are in to see the office hours schedule for Spring 2024. - -xref:spring2024/office_hours_102.adoc[[.custom_button]#TDM 102#] - -xref:spring2024/office_hours_202.adoc[[.custom_button]#TDM 202#] - -xref:spring2024/office_hours_302.adoc[[.custom_button]#TDM 302#] - -xref:spring2024/office_hours_402.adoc[[.custom_button]#TDM 402#] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_102.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_102.adoc deleted file mode 100644 index 37ad5082e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_102.adoc +++ /dev/null @@ -1,23 +0,0 @@ -= TA Office Hours - TDM 10200 - -Office hours locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 and Online in Zoom -- **Office hours _after_ 5:00 PM EST:** Online in Zoom -- **Office hours on _Sunday_:** Online in Zoom - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find a list of TDM 102 T.A.s on the xref:spring2024/102_TAs.adoc[*T.A. Teams*] page. -==== - -image::temp_102.png[TDM 101 Office Hours] diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_202.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_202.adoc deleted file mode 100644 index 752921400..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_202.adoc +++ /dev/null @@ -1,23 +0,0 @@ -= TA Office Hours - TDM 20200 - -Office hours locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 and Online in Zoom -- **Office hours _after_ 5:00 PM EST:** Online in Zoom -- **Office hours on _Sunday_:** Online in Zoom - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find a list of TDM 202 T.A.s on the xref:spring2024/202_TAs.adoc[*T.A. Teams*] page. -==== - -image::office_hours_202.png[TDM 202 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_302.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_302.adoc deleted file mode 100644 index 2d50e73b2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_302.adoc +++ /dev/null @@ -1,23 +0,0 @@ -= TA Office Hours - TDM 30200 - -Office hours locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 and Online in Zoom -- **Office hours _after_ 5:00 PM EST:** Online in Zoom -- **Office hours on _Sunday_:** Online in Zoom - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find a list of TDM 302 T.A.s on the xref:spring2024/302_TAs.adoc[*T.A. Teams*] page. -==== - -image::office_hours_302.png[TDM 302 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_402.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_402.adoc deleted file mode 100644 index 1c0f27c98..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/office_hours_402.adoc +++ /dev/null @@ -1,28 +0,0 @@ -= TA Office Hours - TDM 40200 - -Office hours locations: - -- **Office hours _before_ 5:00 PM EST:** Hillenbrand Hall Lobby C100 and Online in Zoom -- **Office hours _after_ 5:00 PM EST:** Online in Zoom -- **Office hours on _Sunday_:** Online in Zoom - -[IMPORTANT] -==== -*Zoom Link for Office Hours*: https://purdue-edu.zoom.us/s/97774213087 - -- When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -- After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. -==== - -[NOTE] -==== -You can find a list of TDM 402 T.A.s on the xref:spring2024/402_TAs.adoc[*T.A. Teams*] page. -==== - -[NOTE] -==== -Jackson's office hours will be held online regularly, but can be made in-person by request only. Please send an email to fairj@purdue.edu to request an in-person meeting. -==== - -image::office_hours_402.png[TDM 402 Office Hours] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/schedule.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/schedule.adoc deleted file mode 100644 index dedd9daa5..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/schedule.adoc +++ /dev/null @@ -1,28 +0,0 @@ -= Course Schedule - Spring 2024 - -Below is the course schedule and includes release and due dates for syllabus and academic integrity quizzes, weekly projects, and outside events. - -[%header,format=csv] -|=== -Project,Release date,Due date -Syllabus Quiz,Jan 8,Jan 19 -Academic Integrity Quiz,Jan 8,Jan 19 -Project 1,Jan 8,Jan 19 -Project 2,Jan 11,Jan 26 -Project 3,Jan 25,Feb 2 -Outside Event 1,Jan 8,Feb 2 -Project 4,Feb 1,Feb 9 -Project 5,Feb 8,Feb 16 -Project 6,Feb 15,Feb 23 -Project 7,Feb 22,Mar 1 -Outside Event 2,Jan 8, Mar 1 -Project 8,Feb 29,Mar 8 -Project 9,Mar 7,Mar 22 -Project 10,Mar 21,Mar 29 -Project 11,Mar 28,Apr 5 -Project 12,Apr 4,Apr 12 -Outside Event 3,Jan 8,Apr 12 -Project 13,Apr 11,Apr 19 -Project 14,Apr 18,Apr 26 - -|=== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus.adoc deleted file mode 100644 index 1558db32e..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus.adoc +++ /dev/null @@ -1,287 +0,0 @@ -= Spring 2024 Syllabus - The Data Mine Seminar - -== Course Information - - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10200 - The Data Mine II, possible CRNs 19799 or 19803 or 19810 or 19841 -TDM 20200 - The Data Mine IV, possible CRNs 19800 or 19805 or 19811 or 19842 -TDM 30200 - The Data Mine VI, possible CRNs 19801 or 19807 or 19817 or 19843 -TDM 40200 - The Data Mine VIII, possible CRNs 19802 or 19808 or 19821 or 19859 -TDM 50100 - The Data Mine Seminar, possible CRNs 19997 or 20006 or 20007 or 20010 -|=== - -*Course credit hours:* 1 credit hour, so you should expect to spend about 3 hours per week doing work for the class - -*Prerequisites:* -TDM 10100 and TDM 10200 can be taken in either order. Both of these courses are introductory. TDM 10100 is an introduction to data analysis in R. TDM 10200 is an introduction to data analysis in Python. - -For all of the remaining TDM seminar courses, students are expected to take the courses in order (with a passing grade), namely, TDM 20100, 20200, 30100, 30200, 40100, 40200. The topics in these courses build on the knowledge from the previous courses. All students, regardless of background are welcome. TDM 50100 is geared toward graduate students and can be taken repeatedly; TDM 50100 meets concurrently with the other courses, at whichever level is appropriate for the graduate students in the course. We can make adjustments on an individual basis if needed. - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. Students will also take the quizzes at the beginning of the semester on Brightspace -- link:https://datamine.purdue.edu[*The Data Mine's website*] - helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium--no meal swipe required), and 4:30 pm (link:https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later; This online meeting is also available to students participating in Seminar from other universities outside of Purdue). All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings. Instead, the seminar time is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. Attendance is not required. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major and from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -1. Discover data science and professional development opportunities in order to prepare for a career. -2. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -3. Design efficient search strategies in order to acquire new data science skills. -4. Devise the most appropriate data science strategy in order to answer a research question. -5. Apply data science techniques in order to answer a research question about a big data set. - -=== Mapping to Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -1. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* In each of the 14 weekly projects, the scope is described at a high level at the very top of the project. Students are expected to tie their analysis on the individual weekly questions back to the stated scope. As an example of the stated scope in a project: `Understanding how to use Pandas and be able to develop functions allows for a systematic approach to analyzing data.` In this project, students will already be familiar with Pandas but will not (yet) know at the outset how to "develop functions" and take a "systematic approach" to solving the questions. Students are expected to comment on each question about how their "line of inquiry" and "formulation of the question" ties back to the stated scope of the project. As the seminar progresses past the first few weeks, and the students are being asked to tackle more complex problems, they need to identify which Python, SQL, R, and UNIX tools to use, and which statements and queries to run (this is "formulating the questions"), in order to get to analyze the data, derive the results, and summary the results in writing and visualizations ("determining the scope of the investigation"). -2. *Locate information using effective search strategies and relevant information sources.* The Data Mine seminar progresses by increasing the complexity of the problems. The students are being asked to solve complex problems using data science tools. Students need to "locate information" within technical documentation, API documentation, online manuals, online discussions such as Stack Overflow, etc. Within these online resources, they need to determine the "relevant information sources" and apply these sources to solve the data analysis problem at hand. They need to understand the context, motivation, technical notation, nomenclature of the tools, etc. We enable students to practice this skill on every weekly project during the semester, and we provide additional resources, such as Piazza (an online discussion platform to interact with peers, teaching assistants, and the instructor), office hours throughout the week, and attending in-person or virtual seminar, for interaction directly with the instructor. -3. *Evaluate the credibility of information.* The students work together this objective in several ways. They need evaluate and analyze the "credibility of information" and data from a wide array of resources, e.g., from the federal government, from Kaggle, from online repositories and archives, etc. Each project during the semester focuses attention on a large data repository, and the students need to understand the credible data, the missing data, the inaccurate data, the data that are outliers, etc. Some of the projects for students involve data cleansing efforts, data imputation, data standardization, etc. Students also need to validate, verify, determine any missing data, understand variables, correlation, contextual information, and produce models and data visualizations from the data under consideration. -4. *Synthesize and organize information from different sources in order to communicate.* This is a key aspect of The Data Mine. In many of the student projects, they need to assimilate geospatial data, categorical and numerical data, textual data, and visualizations, in order to have a comprehensive data analysis of a system or a model. The students can use help from Piazza, office hours, the videos from the instructor and seminar live sessions to synthesize and organize the information they are learning about, in each project. The students often need to also understand many different types of tools and aspects of data analysis, sometimes in the same project, e.g., APIs, data dictionaries, functions, concepts from software engineering such as scoping, encapsulation, containerization, and concepts from spatial and temporal analysis. Synthesizing many "different sources" to derive and "communicate" the analysis is a key aspect of the projects. -5. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* In every project, students need to use "citations to sources" (online and written), "referencing" forums and blogs where their cutting-edge concepts are "documented", proper methods of "quotation" and "citation", documentation of any teamwork, etc. The students have a template for their project submissions in which they are required to provide the proper citation of any sources, collaborations, reference materials, etc., in each and every project that they submit every week. -6. *Recognize relevant cultural and other contextual factors when using information.* Students weekly project include data and information on data about (all types of genders), political data, geospatial questions, online forums and rating schema, textual data, information about books, music, online repositories, etc. Students need to understand not only the data analysis but also the "context" in which the data is provided, the data sources, the potential usage of the analysis and its "cultural" implications, etc. Students also complete professional development, attending several professional development and outside-the-classroom events each semester. The meet with alumni, business professionals, data practitioners, data engineers, managers, scientists from national labs, etc. They attend events about the "culture related to data science", and "multicultural events". Students are required to respond in writing to every such event, and their writing is graded and incorporated into the grades for the course. -7. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester that sets the stage of these "ethical and legal guidelines and requirements". They have documentation about proper data handling and data management techniques. They learn about the context of data usage, including (for instance) copyrights, the difference between open source and proprietary data, different types of software licenses, the need for confidentiality with Corporate Partners projects, etc. - - -=== Assessment of Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -1. *Assessment method for this course.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. -2. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Students identify which R and Python statements and queries to run (this is formulating the questions), in order to get to the results they think they are looking for (determining the scope of the investigation). -3. *Locate information using effective search strategies and relevant information sources.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. The students are being asked to solve complex problems using data science tools. They need to figure out what they are looking to figure out, and to do that they need to figure out what to ask. -4. *Evaluate the credibility of information.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Some of the projects that students complete in the course involve data cleansing efforts including validation, verification, missing data, and modeling and students must evaluate the credibility as they move through the project. -5. *Synthesize and organize information from different sources in order to communicate.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Information on how to complete the projects is learned through many sources and student utilize an experiential learning model. -6. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided set and then questions about the data set that engage the student in experiential learning. At the beginning of each project there is a question regarding citations for the project. -7. *Recognize relevant cultural and other contextual factors when using information.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. For professional development event assessment – students are required to attend three approved events and then write a guided summary of the event. -8. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester, and they are also graded on their proper documentation and usage of data throughout the semester, on every weekly project. - - - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Brightspace course page. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - - - -=== Attendance Policy - -When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. - -For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website. - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don't wait until the due date to start! -* Be excited to challenge yourself and learn impressive new skills. Don't get discouraged if something is difficult--you're here because you want to learn, not because you already know everything! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don't see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don't forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you -should keep a copy of the course schedule where you see it easily. -* Read your emails! - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title -Shared email we all read, datamine-help@purdue.edu -Kevin Amstutz, Senior Data Scientist -Donald Barnes, Guest Relations Administrator -Maggie Betz, Managing Director of Corporate Partnerships -Kimmie Casale, ASL Tutor -Cai Chen, Corporate Partners Technical Specialist -Doug Crabill, Senior Data Scientist -Lauren Dalder, Corporate Partners Advisor -Stacey Dunderman, Program Administration Specialist -David Glass, Managing Director of Data Science -Betsy Hillery, Business Development Administrator -Emily Hoeing, Corporate Partners Advisor -Jessica Jud, Senior Manager of Expansion Operations -Kali Lacy, Associate Research Engineer -Gloria Lenfestey, Research Development Administrator -Nicholas Lenfestey, Corporate Partners Technical Specialist -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator -Kim Rechkemmer, Senior Program Administration Specialist -Nick Rosenorn, Corporate Partners Technical Specialist -Katie Sanders, Operations Manager -Betsy Satchell, Senior Administrative Assistant -Dr. Rebecca Sharples, Managing Director of Academic Programs and Outreach -Dr. Mark Daniel Ward, Director -Josh Winchester, Data Science Technical Specialist -Cindy Zhou, Senior Data Science Instructional Specialist - -|=== - -The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses. - -You can use the link:https://piazza.com/[Piazza forum] to get in touch. In particular, Dr. Ward responds to questions on Piazza faster than by email. - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest email by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:spring2024/logistics/office_hours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available schedule. - -=== Piazza - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes. - - -== Assignments and Grades - - -=== Course Schedule & Due Dates - -xref:spring2024/logistics/schedule.adoc[Click here to view the Spring 2024 Course Schedule] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the following Friday at 11:59 pm Purdue West Lafayette (Eastern) time. This semester, there are 14 weekly projects, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won't hurt your grade. - -We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren't done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. *Try to start your projects on or before Monday each week to leave yourself time to ask questions.* - -Outside of projects, you will also complete 3 Outside Event reflections. More information about these is in the "Outside Event Reflections" section below. - -The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the link:https://catalog.purdue.edu/content.php?catoid=16&navoid=20089[Academic Calendar]. - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 14 projects available over the semester, and your best 10 will count. -* The 4 project grades that are dropped could be from illnesses, absences, travel, family -emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the -last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office -hours. We're waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in -Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the -comment sections of your project submission. You need to say who helped you and how they -helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be -the topic of the seminar and any office hours that week, and then the project will be due by -11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature -inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* Throughout the semester, The Data Mine will have many special events and speakers, typically happening in person so you can interact with the presenter, but some may be online and possibly recorded. -* These eligible opportunities will be posted on The Data Mine's website (https://datamine.purdue.edu/events/) and updated frequently. Feel free to suggest good events that you hear about, too. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. -* You are welcome to do all 3 reflections early. For example, you could submit all 3 reflections in September. -* You must submit your outside event reflection within 1 week of attending the event or watching the recording. -* Follow the instructions on Brightspace for writing and submitting these reflections. -* At least one of these events should be on the topic of Professional Development. These -events will be designated by "PD" next to the event on the schedule. -* This semester you will answer questions directly in Gradescope including the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event--it is a reflection. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - -=== Late Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can't do this if we are waiting for other students to turn their work in. - -=== Grade Distribution - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - -=== Grading Scale -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. +/- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B+. - -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10200/20200/30200/40200/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who -have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody's completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I'll do #2, and he'll do #3: then we'll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an -unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else's computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else's answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else's outside event reflection rather than attending the event yourself. - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - -=== xref:spring2024/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources] - -=== Disclaimer -This syllabus is subject to small changes. All questions and feedback are always welcome! diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus_purdue_policies.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus_purdue_policies.adoc deleted file mode 100644 index 2ed603182..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/syllabus_purdue_policies.adoc +++ /dev/null @@ -1,88 +0,0 @@ -== Purdue Policies & Resources - -=== Class Behavior - -You are expected to behave in a way that promotes a welcoming, inclusive, productive learning environment. You need to be prepared for your individual and group work each week, and you need to include everybody in your group in any discussions. Respond promptly to all communications and show up for any appointments that are scheduled. If your group is having trouble working well together, try hard to talk through the difficulties--this is an important skill to have for future professional experiences. If you are still having difficulties, ask The Data Mine staff to meet with your group. - - -*Purdue's Copyrighted Materials Policy:* - -Among the materials that may be protected by copyright law are the lectures, notes, and other material presented in class or as part of the course. Always assume the materials presented by an instructor are protected by copyright unless the instructor has stated otherwise. Students enrolled in, and authorized visitors to, Purdue University courses are permitted to take notes, which they may use for individual/group study or for other non-commercial purposes reasonably arising from enrollment in the course or the University generally. -Notes taken in class are, however, generally considered to be "derivative works" of the instructor's presentations and materials, and they are thus subject to the instructor's copyright in such presentations and materials. No individual is permitted to sell or otherwise barter notes, either to other students or to any commercial concern, for a course without the express written permission of the course instructor. To obtain permission to sell or barter notes, the individual wishing to sell or barter the notes must be registered in the course or must be an approved visitor to the class. Course instructors may choose to grant or not grant such permission at their own discretion, and may require a review of the notes prior to their being sold or bartered. If they do grant such permission, they may revoke it at any time, if they so choose. - -=== Nondiscrimination Statement -Purdue University is committed to maintaining a community which recognizes and values the inherent worth and dignity of every person; fosters tolerance, sensitivity, understanding, and mutual respect among its members; and encourages each individual to strive to reach his or her own potential. In pursuit of its goal of academic excellence, the University seeks to develop and nurture diversity. The University believes that diversity among its many members strengthens the institution, stimulates creativity, promotes the exchange of ideas, and enriches campus life. link:https://www.purdue.edu/purdue/ea_eou_statement.php[Link to Purdue's nondiscrimination policy statement.] - -=== Students with Disabilities -Purdue University strives to make learning experiences as accessible as possible. If you anticipate or experience physical or academic barriers based on disability, you are welcome to let me know so that we can discuss options. You are also encouraged to contact the Disability Resource Center at: link:mailto:drc@purdue.edu[drc@purdue.edu] or by phone: 765-494-1247. - -If you have been certified by the Office of the Dean of Students as someone needing a course adaptation or accommodation because of a disability OR if you need special arrangements in case the building must be evacuated, please contact The Data Mine staff during the first week of classes. We are happy to help you. - -=== Mental Health Resources - -* *If you find yourself beginning to feel some stress, anxiety and/or feeling slightly overwhelmed,* try link:https://purdue.welltrack.com/[WellTrack]. Sign in and find information and tools at your fingertips, available to you at any time. -* *If you need support and information about options and resources*, please contact or see the link:https://www.purdue.edu/odos/[Office of the Dean of Students]. Call 765-494-1747. Hours of operation are M-F, 8 am- 5 pm. -* *If you find yourself struggling to find a healthy balance between academics, social life, stress*, etc. sign up for free one-on-one virtual or in-person sessions with a link:https://www.purdue.edu/recwell/fitness-wellness/wellness/one-on-one-coaching/wellness-coaching.php[Purdue Wellness Coach at RecWell]. Student coaches can help you navigate through barriers and challenges toward your goals throughout the semester. Sign up is completely free and can be done on BoilerConnect. If you have any questions, please contact Purdue Wellness at evans240@purdue.edu. -* *If you're struggling and need mental health services:* Purdue University is committed to advancing the mental health and well-being of its students. If you or someone you know is feeling overwhelmed, depressed, and/or in need of mental health support, services are available. For help, such individuals should contact link:https://www.purdue.edu/caps/[Counseling and Psychological Services (CAPS)] at 765-494-6995 during and after hours, on weekends and holidays, or by going to the CAPS office of the second floor of the Purdue University Student Health Center (PUSH) during business hours. - -=== Violent Behavior Policy - -Purdue University is committed to providing a safe and secure campus environment for members of the university community. Purdue strives to create an educational environment for students and a work environment for employees that promote educational and career goals. Violent Behavior impedes such goals. Therefore, Violent Behavior is prohibited in or on any University Facility or while participating in any university activity. See the link:https://www.purdue.edu/policies/facilities-safety/iva3.html[University's full violent behavior policy] for more detail. - -=== Diversity and Inclusion Statement - -In our discussions, structured and unstructured, we will explore a variety of challenging issues, which can help us enhance our understanding of different experiences and perspectives. This can be challenging, but in overcoming these challenges we find the greatest rewards. While we will design guidelines as a group, everyone should remember the following points: - -* We are all in the process of learning about others and their experiences. Please speak with me, anonymously if needed, if something has made you uncomfortable. -* Intention and impact are not always aligned, and we should respect the impact something may have on someone even if it was not the speaker's intention. -* We all come to the class with a variety of experiences and a range of expertise, we should respect these in others while critically examining them in ourselves. - -=== Basic Needs Security Resources - -Any student who faces challenges securing their food or housing and believes this may affect their performance in the course is urged to contact the Dean of Students for support. There is no appointment needed and Student Support Services is available to serve students from 8:00 - 5:00, Monday through Friday. The link:https://www.purdue.edu/vpsl/leadership/About/ACE_Campus_Pantry.html[ACE Campus Food Pantry] is open to the entire Purdue community). - -Considering the significant disruptions caused by the current global crisis as it related to COVID-19, students may submit requests for emergency assistance from the link:https://www.purdue.edu/odos/resources/critical-need-fund.html[Critical Needs Fund]. - -=== Course Evaluation - -During the last two weeks of the semester, you will be provided with an opportunity to give anonymous feedback on this course and your instructor. Purdue uses an online course evaluation system. You will receive an official email from evaluation administrators with a link to the online evaluation site. You will have up to 10 days to complete this evaluation. Your participation is an integral part of this course, and your feedback is vital to improving education at Purdue University. I strongly urge you to participate in the evaluation system. - -You may email feedback to us anytime at link:mailto:datamine-help@purdue.edu[datamine-help@purdue.edu]. We take feedback from our students seriously, as we want to create the best learning experience for you! - -=== General Classroom Guidance Regarding Protect Purdue - -Any student who has substantial reason to believe that another person is threatening the safety of others by not complying with Protect Purdue protocols is encouraged to report the behavior to and discuss the next steps with their instructor. Students also have the option of reporting the behavior to the link:https://purdue.edu/odos/osrr/[Office of the Student Rights and Responsibilities]. See also link:https://catalog.purdue.edu/content.php?catoid=7&navoid=2852#purdue-university-bill-of-student-rights[Purdue University Bill of Student Rights] and the Violent Behavior Policy under University Resources in Brightspace. - -=== Campus Emergencies - -In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. Here are ways to get information about changes in this course: - -* Brightspace or by e-mail from Data Mine staff. -* General information about a campus emergency can be found on the Purdue website: xref:www.purdue.edu[]. - - -=== Illness and other student emergencies - -Students with *extended* illnesses should contact their instructor as soon as possible so that arrangements can be made for keeping up with the course. Extended absences/illnesses/emergencies should also go through the Office of the Dean of Students. - -*Official Purdue University links to Resources and Guidelines:* - -=== University Policies and Statements - -- link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[Academic Integrity] -- link:https://www.purdue.edu/purdue/ea_eou_statement.php[Nondiscrimination Policy Statement] -- link:https://www.purdue.edu/advocacy/students/absences.html[Class Absences] -- link:https://catalog.purdue.edu/content.php?catoid=15&navoid=18634#classes[Attendance] -- link:https://www.purdue.edu/policies/ethics/iiia1.html[Amourous Relationships] -- link:https://www.purdue.edu/ehps/emergency-preparedness/[Emergency Preparedness] -- link:https://www.purdue.edu/policies/facilities-safety/iva3.html[Violent Behavior] -- link:https://www.purdue.edu/policies/academic-research-affairs/ia3.html[Use of Copyrighted Materials] - -=== Student Support and Resources - -- link:https://www.purdue.edu/asc/resources/get-engaged.html[Engage In Your Learning] -- link:https://www.purdue.edu/policies/information-technology/s5.html[Purdue's Web Accessibility Policy] -- link:https://www.d2l.com/accessibility/standards/[Accessibility Standard in Brightspace] - -=== Disclaimer -This syllabus is subject to change. Changes will be made by an announcement in Brightspace and the corresponding course content will be updated. \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/logistics/ta_teams.adoc b/projects-appendix/modules/ROOT/pages/spring2024/logistics/ta_teams.adoc deleted file mode 100644 index c024e9238..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/logistics/ta_teams.adoc +++ /dev/null @@ -1,20 +0,0 @@ -= T.A. Teams - -Please select the level you are in to see your T.A.s for Spring 2024. - -*Head TA*: Pramey Kabra - kabrap@purdue.edu - -[NOTE] -==== -Use the link below to give your favorite seminar TAs a shout-out, and tell us how they helped you learn at The Data Mine! - -https://forms.office.com/r/mzM3ACwWqP -==== - -link:https://the-examples-book.com/projects/current-projects/spring2024/102_TAs[[.custom_button]#TDM 102#] - -link:https://the-examples-book.com/projects/current-projects/spring2024/202_TAs[[.custom_button]#TDM 202#] - -link:https://the-examples-book.com/projects/current-projects/spring2024/302_TAs[[.custom_button]#TDM 302#] - -link:https://the-examples-book.com/projects/current-projects/spring2024/402_TAs[[.custom_button]#TDM 402#] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2024/nav.adoc b/projects-appendix/modules/ROOT/pages/spring2024/nav.adoc deleted file mode 100644 index f2327e2f9..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2024/nav.adoc +++ /dev/null @@ -1,59 +0,0 @@ -* Languages & Tools -** xref:programming-languages:python:index.adoc[Python] -** xref:programming-languages:R:index.adoc[R] -** xref:programming-languages:SQL:index.adoc[SQL] - -* xref:spring2024/syllabus.adoc[Syllabus] - -* Logistics -** xref:spring2024/schedule.adoc[Course Schedule] -** TA Teams -*** xref:spring2024/102_TAs.adoc[TDM 10200] -*** xref:spring2024/202_TAs.adoc[TDM 20200] -*** xref:spring2024/302_TAs.adoc[TDM 30200] -*** xref:spring2024/402_TAs.adoc[TDM 40200] -** Office Hours -*** xref:spring2024/office_hours_102.adoc[TDM 10200] -*** xref:spring2024/office_hours_202.adoc[TDM 20200] -*** xref:spring2024/office_hours_302.adoc[TDM 30200] -*** xref:spring2024/office_hours_402.adoc[TDM 40200] -** xref:submissions.adoc[Submissions] -** xref:templates.adoc[Templates] - -* Current Projects -** xref:tdm-course-overview.adoc[TDM Course Overview] - -** Spring 2024 -*** xref:10200-2024-projects.adoc[TDM 10200] -**** xref:10200-2024-project01.adoc[Project 1] -**** xref:10200-2024-project02.adoc[Project 2] -**** xref:10200-2024-project03.adoc[Project 3] -**** xref:10200-2024-project04.adoc[Project 4] -**** xref:10200-2024-project05.adoc[Project 5] -**** xref:10200-2024-project06.adoc[Project 6] -**** xref:10200-2024-project07.adoc[Project 7] -**** xref:10200-2024-project08.adoc[Project 8] -**** xref:10200-2024-project09.adoc[Project 9] -**** xref:10200-2024-project10.adoc[Project 10] -**** xref:10200-2024-project11.adoc[Project 11] -**** xref:10200-2024-project12.adoc[Project 12] -**** xref:10200-2024-project13.adoc[Project 13] -**** xref:10200-2024-project14.adoc[Project 14] -*** xref:20200-2024-projects.adoc[TDM 20200] -**** xref:20200-2024-project01.adoc[Project 1] -**** xref:20200-2024-project02.adoc[Project 2] -**** xref:20200-2024-project03.adoc[Project 3] -**** xref:20200-2024-project04.adoc[Project 4] -**** xref:20200-2024-project05.adoc[Project 5] -**** xref:20200-2024-project06.adoc[Project 6] -**** xref:20200-2024-project07.adoc[Project 7] -**** xref:20200-2024-project08.adoc[Project 8] -**** xref:20200-2024-project09.adoc[Project 9] -**** xref:20200-2024-project10.adoc[Project 10] -**** xref:20200-2024-project11.adoc[Project 11] -**** xref:20200-2024-project12.adoc[Project 12] -**** xref:20200-2024-project13.adoc[Project 13] -**** xref:20200-2024-project14.adoc[Project 14] -*** xref:30200-2024-projects.adoc[TDM 30200] -*** xref:40200-2024-projects.adoc[TDM 40200] - diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project1.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project1.adoc deleted file mode 100644 index 208209709..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project1.adoc +++ /dev/null @@ -1,248 +0,0 @@ -= TDM 10100: Python Project 1 -- 2024 - -:imagesdir: ./images - -**Motivation:** The goal of this project is to get you comfortable with the basics of operating in Jupyter notebooks as hosted on Anvil, our computing cluster. If you don't understand the code you're running/writing at this point in the course, that's okay! We are going to go into detail about how everything works in future projects. - -**Context:** There's no important prior context needed for this project! However, if you are interested in learning more there are plenty of online resources available that go into greater detail about the inner workings of Jupyter notebooks. - -**Scope:** Anvil, Jupyter Labs, Jupyter Notebooks, Python - -.Learning Objectives: -**** -- Learn to create Jupyter notebooks -- Gain proficiency manipulating Jupyter notebook contents -- Learn how to upload/download files to/from Anvil -- Write basic Python code to read in data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/icecream/breyers/reviews.csv -- /anvil/projects/tdm/data/icecream/bj/products.csv - -== Questions - -=== Question 1 (2 pts) - -First and foremost, welcome to The Data Mine! We hope that throughout your journey with us, you learn a lot, make new friends, and develop skills that will help you with your future career. Throughout your time with The Data Mine, you will have plenty of resources available should you need help. By coming to weekly seminar, posting on the class Piazza, and joining Dr. Ward and the TA team's office hours, you can ensure that you always have the support you need to succeed in this course. - -[IMPORTANT] -==== -If you did not (yet) set up your 2-factor authentication credentials with Duo, you can set up the credentials here: https://the-examples-book.com/starter-guides/anvil/access-setup. If you're stillinng having issues with your ACCESS ID, please send an email containing as much information as possible about your issue to datamine-help@purdue.edu -==== - -Let's start off by starting up our first Jupyter session on https://www.rcac.purdue.edu/compute/anvil[Anvil]! First, visit https://ondemand.anvil.rcac.purdue.edu/[this link] and sign in using the username and password you picked when you set up your credentials. - -In the upper-middle part of your screen, you should see a dropdown button labeled `The Data Mine`. Click on it, then select `Jupyter Notebook` from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this: - -image::1-1.png[OnDemand Jupyter Lab, width=792, height=500, loading=lazy, title="OnDemand Jupyter Lab"] - -If your screen doesn't look like this, please try and select the correct dropdown option again or visit seminar for more assistance. - -There are a few key parts of this screen to note: - -- Allocation: this should always be cis220051 for The Data Mine -- Queue: again, this should stay on the default option `shared` unless otherwise noted. -- Time in Hours: The amount of time your Jupyter session will last. When this runs out, you'll need to start a new session. It may be tempting to set it to the maximum, but our computing cluster is a shared resource. This means every hour you use is an hour someone else can't use, so please only reserve it for 1-2 hours at a time. -- CPU cores: The amount of RAM that your Jupyter session will have access to. RAM can be confusing, so if you've never heard of it before the principle idea is that RAM = computing power. This is also a shared resource, and you should almost never need more than 3 cores for any project in TDM 101. For most projects, we will tell you how many cores you should use. - -[IMPORTANT] -==== -When your session ends, you will no longer be able to save/edit your work. Be sure to save on a regular basis so that even when your session ends, your work is safe. To put it more simply: a session ending does not delete your work, and anything you saved prior to the session ending will still be there when you start a new session. -==== - -With the key parts of this screen explained, go ahead and select 1 hour of time with 2 CPU cores, ensure that the `Use Jupyter Lab instead of Jupyter Notebook` box is checked, and click Launch! After a bit of waiting, you should see something like below. Click connect to Jupyter and proceed to the next question! - -image::1-2.png[Launch Jupyter Lab, width=792, height=500, loading=lazy, title="Launch Jupyter Lab"] - - -[IMPORTANT] -==== -You likely noticed a short wait before your Jupyter session launched. This happens while Anvil finds and allocates space for you to work. The more students are working on Anvil, the longer this will take, so it is our suggesting to start your projects early during the week to avoid any last-minute hiccups causing a missed deadline. -==== - -To cement the idea of Anvil being a large (but still limited) resource, please visit https://www.rcac.purdue.edu/compute/anvil[this website]. Read through the information about Anvil (its short!) and pay special notice to the table at the bottom about Anvil's sub-clusters. For this question, we want you to calculate how many nodes, cores, and total memory (in GB) Anvil has between sub-clusters A, B, and G. (Hint: 1TB = 1000GB). In the next question, we'll walk you through where to write your answer, so for now just keep these numbers noted. - -.Deliverables -==== -- The total number of nodes, cores, and memory in Anvil sub-clusters A, B, and G combined. -==== - -=== Question 2 (2 pts) - -Once you connect to Jupyter, you should be on a screen that looks similar to this: - -image::1-3.png[Jupyter Lab Homescreen, width=792, height=500, loading=lazy, title="Jupyter Lab Homescreen"] - -Before you jump into Jupyter, take a minute to read through https://the-examples-book.com/starter-guides/tools-and-standards/jupyter[this page] that runs through most of the basics about Jupyter Labs. Additionally, take note of the 'Launcher' tab that is taking up most of the screen. The different options that you see (like Python 3, R 4.0.5, and Testing) are called https://the-examples-book.com/starter-guides/tools-and-standards/unix/jupyter-lab-kernels[kernels], and each kernel reads and runs code slightly differently. For Python, we'll be using the `seminar` kernel, but you should just keep that in your back pocket for now. - -Take a second to download our project template https://the-examples-book.com/projects/current-projects/_attachments/project_template.ipynb[here] (which can also be found on Anvil at `/anvil/projects/tdm/etc/project_template.ipynb`) Then upload the template to Jupyter and open it. - -When you first open the template, you may get a pop-up asking you to select what kernel you'll be using. Select `seminar`. You may have to scroll down to find it. If you do not get this pop-up, you can also select a kernel by clicking on the upper right part of your screen that likely says something similar to `No Kernel`, and then selecting the kernel you want to use. - -A Jupyter notebook is made up of `cells`, which you can edit and then `run`. There are two types of cells we'll work in for this class: - -- Markdown cells. These are where your writing, titles, sections, and paragraphs will go. Double clicking a markdown cell puts it in `edit` mode, and then clicking the play button near the top of the screen runs the cell, which puts it in its formatted form. More on this in a second. -- Code cells. These are where you will write and run all your code! Clicking the play button will run the code in that cell, and the programming language will be inferred based on the kernel that you chose. - -For this question, you're responsible for three main tasks: - -. Fill in Question 1 with the information you found previously, in a markdown cell. -. In Question 2, copy and paste `print("Hello and Welcome to The Data Mine!")` into the code cell, and then run it. You should see it output "Hello and welcome to The Data Mine!", which is the result of running your code. -. In the markdown cell for Question 2, please show three different examples of markdown elements. https://www.markdownguide.org/cheat-sheet/[This cheatsheet] is a good resource for some common markdown elements that you can see. An example you could do is a header, an ordered list, and some bold text. Be sure to run the cell after filling it in to see the results of your markdown! - -[NOTE] -==== -Some common Jupyter notebooks shortcuts: - -- Instead of clicking the `play button`, you can press ctrl+enter (or cmd+enter on Mac) to run a cell. -- If you want to run the current cell and then immediately create a new code cell below it, you can press alt+enter (or option+enter on Mac) to do so. -- When a cell is selected (this means you clicked next to it, and it should show a blue bar to its left to signify this), pressing the `d` key twice will delete that cell. -- When a cell is selected, pressing the `a` key will create a new code cell `a`bove the currently selected cell. -- When a cell is selected, pressing the `b` key will create a new code cell `b`elow the selected cell -==== - -As this is our first real task of the semester, you'll find a photo below of what your completed Question 2 may look like. Note that yours may differ slightly. - -image::1-4.png[Question 2 Example Answer, width=792, height=500, loading=lazy, title="Question 2 Example Answer"] - -.Deliverables -==== -- Your answers from Question 1, filled in. -- The result of running the provided `print()` code. -- Three examples of markdown elements in your markdown cell. -==== - -=== Question 3 (2 pts) - -Let's get more comfortable with code cells in Jupyter by learning how to run code in different languages! While most of the code you'll run in this course will be in either Python or R, sometimes different languages like Bash, Perl, and more will provide more straightforward answers to a problem. - -In Question 3, copy the following Python code into a code cell and run it. This will read in some data, and then tell you how much space (in bytes) your dataframe is taking up!: - -[source, python] ----- -import pandas as pd -my_df = pd.read_csv("/anvil/projects/tdm/data/icecream/breyers/reviews.csv") -print(my_df.memory_usage(index=True, deep=True).sum(), "bytes") ----- - -Now let's do the same thing but in Bash! Create a new code cell below the one you just ran (refer to the hint in the last question for a shortcut on how to do this), and copy in the below code: - -[source, bash] ----- -%%bash - -echo $(du /anvil/projects/tdm/data/icecream/breyers/reviews.csv --bytes | cut -f1) bytes ----- - -Running this should give you a smaller output than the Python output. This is because in bash, we are checking the size of the stored data, while in Python we are reading the data into a `dataframe` that has a bit more memory associated with it to make it easier to work with. - -[NOTE] -==== -As a side note, bash is an **extremely** important foundational tool to working with data and computers more generally. As a 'command line tool', `bash` is essentially a foundational programming language that is very close to the computer's basic hardware, and has a lot of fast, efficient, and useful tools that are useful no matter what project you're working on. From navigating through file directories, to writing basic scripts, to locating and running programs, `bash` is hiding in the background of most everything your computer does. -==== - -Take note of the `%%bash` line in the cell you just ran. This is called `line magic`, and it tells our kernel that we want it to run our code as a different language than the default. As an added example, writing `%%R` will allow us to run code in the R programming language. - -[NOTE] -==== -For more information on line magic and how it works, please refer to https://ipython.readthedocs.io/en/stable/interactive/magics.html#[this page]. -==== - -To further cement your understanding of line magic, we are going to translate one more bit of code from Python to Bash. Let's take the `print()` code from the last problem and convert it to its Bash equivalent! As a reminder, here is the Python code to translate to Bash - -[source, python] ----- -print("Hello and Welcome to The Data Mine!") ----- - -[NOTE] -==== -Printing in Bash can be done using the `echo` command. For example, if I wanted to print "Dr. Ward is a robot" I could write `echo Dr. Ward is a robot` -==== - -.Deliverables -==== -- The code and results of running the code to show your hostname in both Python and Bash. -- The Bash equivalent of the `print()` statement from the last problem, and the results of running it. -==== - -=== Question 4 (2 pts) - -In the next 2 questions we are going to introduce some new code that will allow us to read in large datasets and begin to work with them! If you don't understand the specifics, that's okay for now. For now, let's just learn by doing. To start, run the following Python: - -[source, python] ----- -import pandas as pd - -my_df = pd.read_csv("/anvil/projects/tdm/data/icecream/breyers/reviews.csv") -print(my_df.shape) -print(my_df.head) ----- - -The breakdown of this code is as follows: - -. We import the `pandas` library, and we add `as pd` so we don't have to type out the full name every time we want to use it -. We use the `read_csv` function from the `pandas` library to read the data from the given file into a dataframe we call my_df. -. We print the shape of the dataframe, my_df. You should see an output of (5007, 8) -. We print the `head` of the dataframe, which is just the first 5 rows of our dataframe and the column headers (if they exist). - -For the last part of this question, we want you to create a new code cell and write some Python to print the names of the columns of our dataframe. If you do everything correctly, you should see the columns are named key, author, date, stars, title, helpful_yes, helpful_no, and text. If you're struggling, take a look at the hint below: - -[NOTE] -==== -Pandas has a built in method called `columns` that will return the names of the columns of a dataframe. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html[Here] is a link to a documentation page on the method and examples of it being used. An important part of data science and writing code is being able to read and learn from documentation, so we will try and provide relevant pages throughout the course. If you have any questions or are having trouble interpreting some documentation (often just called 'docs'), please reach out. -==== - -.Deliverables -==== -- The result of runnning the provided code that reads in a dataframe and prints its shape. -- A code cell that prints the names of the columns in `my_df` -==== - -=== Question 5 (2 pts) - -Let's take a second to reflect on everything you did and learned during this project. First, you learned how to Launch a Jupyter Notebook session on the Anvil supercomputing cluster. Next you learned about uploading files to Anvil, the general structure of Jupyter notebooks, and how to manipulate the contents of a notebook to fit your working style. Finally, you learned how to write and run some basic code in Jupyter notebooks, including how to read in data! - -In this last question, we are going to try and put everything you learned today together. In the previous question, you read a file on Breyer's ice cream reviews into a Pandas dataframe called `my_df` and printed the number of columns and rows in the dataframe. Finally, we had you write some code to print the names of the columns of `my_df` - -In this question, we want you to read a file on Breyers's ice cream products into a Pandas dataframe called `BreyProd_df`. The path to the file is "/anvil/projects/tdm/data/icecream/bj/products.csv". Next, print the number of rows and columns in `BreyProd_df`, and then print the names of the columns in `BreyProd_df`. - -[NOTE] -==== -The code needed to solve this problem is almost identical to that of the last problem. If you're struggling, considering revisiting Question 4 and trying to better understand what is going on in that code, and feel free to copy the code from Question 4 into Question 5 and modify it directly. -==== - -One way you can validate that your code is working correctly is comparing the results of your code that outputs the number of rows/columns in the dataframe with the code that outputs the names of the columns in the dataframes. The number of columns in the dataframe should match the number of names printed. - -Finally, make sure that your name is at the top of the project template. If you used outside resources (like Stack Overflow) or got help from TAs, make sure to note where you got assistance from, and on what part of the project they assisted you, in the appropriate sections at the top of the template. - -.Deliverables -==== -- Code that reads the `products.csv` file into a dataframe -- Code that prints the shape of the resulting dataframe -- Code that prints the names of the columns in the resulting dataframe -==== - -== Submitting your Work - -Congratulations! Assuming you've completed all the above questions, you've just finished your first project for TDM 10100! If you have any questions or issues regarding this project, please feel free to ask in seminar, over Piazza, during office hours, or by emailing Dr. Ward. Prior to submitting, make sure you've run all of the code in your Jupyter notebook and the results of running that code is visible. More detailed instructions on how to ensure that your submission is formatted correctly can be found https://the-examples-book.com/projects/current-projects/submissions[here]. To download your completed project, you can right-click on the file in the file explorer and click 'download'. - -Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don't lose any points. At the bottom of each 101 project, you will find a comprehensive list of all the files that need to be submitted for that project. We hope your first project with us went well, and we look forward to continuing to learn with you on future projects!! - -.Items to submit -==== -- firstname_lastname_project1.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project2.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project2.adoc deleted file mode 100644 index 8ca36e632..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project2.adoc +++ /dev/null @@ -1,281 +0,0 @@ -= TDM 10100: Python Project 2 -- 2024 - -**Motivation:** Python is one of if not the most used programming language in the world. It is versatile, readable, and a great language for beginners. In the next few projects we will be doing a deep dive into Python, learning about operators, variables, functions, looping and logic, and more! - -**Context:** Project 1's introduction to Jupyter Notebooks will be vital here, and it will be important to understand the basics that we covered last week. Feel free to revisit the project and your work during this project if you need reminders! - -**Scope:** Python, Operators, Conditionals - -.Learning Objectives: -**** -- Learn how to perform basic arithmetic in Python -- Get familiar with conditional structures in Python -- Solve a famous programming problem using math! -- Apply your problem solution to real-world data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/olympics/athlete_events.csv - -== Questions - -=== Question 1 (2 pts) - -The first step to understanding and learning a new programming language is learning its `operators`, symbols that represent processes or math. There are a few different general types of operators, detailed below: - -- arithmetic operators: perform common mathematical operations (i.e `+` is addition, `*` is multiplication, `%` is modulation AKA remainder from division, i.e. 5%4=1) -- assignment operators: assign values to variables. More on variables in next week's project, but variables are basically names that we assign values (i.e. `x = 6` makes the value of x, 6. `x += 6` makes the value of x, the current value + 6. So if x was 5, and then `x += 6` is run, x now has a value of 11) -- comparison operators: compare two values and rturn either `True` or `False`. (For example, `x == y` checks if x and y are the equal. `x != y` checks if x and y are not equal. `x <= y` checks if x is less than or equal to y.) -- logical operators: these are used to combine comparisons to check multiple conditions. (i.e. if we wanted to make sure x was less than 5, but greater than 2, we could write `x < 5 and x > 2`, or more succinctly, `2 < x < 5`) -- membership operators: these are quite unique to Python, and allow us to check, for example, if a list contains a specific value. If we had a list of numbers named `ourlist`, we could write `5 in ourlist` which would return `True` if 5 is in ourlist and `False` if it is not. - -These are the basic types of operators we'll work with during this class. For a more exhaustive list, and direct examples of how to use each operator, https://www.w3schools.com/python/python_operators.asp[this website] can help give detailed descriptions of all the different operators in Python. - -In these next few questions, you'll be asked to write your own code to perform basic tasks using these operators. Please refer to the above linked website and descriptions for reminders on which operators are which, and how they can be used. - -[NOTE] -==== -In Python, everything after a `#` on a line of code is considered a 'comment' by the computer. https://www.w3schools.com/python/python_comments.asp[Comments] serve as notes in your code, and don't do anything when run. It should be a priority to always comment your code well to make it understandable to an outside reader (or you, in the future!). -==== - -[IMPORTANT] -==== -**Precedence** is an important concept with operators, and determines which of our operators "acts first". You can think of it as being similar to the concept of PEMDAS in math. https://www.geeksforgeeks.org/precedence-and-associativity-of-operators-in-python/[This table] details operator precedence in Python and is worth taking a look at before attempting the next two questions. -==== - -For this question, please complete the following tasks in a single code cell. At the very end of the code cell, please add `print(myVariable)` to print the results of your work. Some starter code has been provided below for your reference: - -[source, Python] ----- -# create a new variable named myVariable and assign it a value of 7 -myVariable = 7 - -myVariable # multiply here -myVariable = # subtract, then multiply here -myVariable = # add the two values, then multiply here -myVariable = # complete the rest of the math here - -# print the final value of myVariable (a special date!) -print(myVariable) ----- - -. Create a new variable named `myVariable` and assign it a value of 7. -. Using an assignment operator, multiply the value of `myVariable` by the number representing your birth month. -. In one line of code, using two arithmetic operators, subtract 1 from the value of `myVariable` and then multiply it by 13 -. In one line of code, using three arithmetic operators, add 3, add the day of your birth and then multiply by 11 the value of `myVariable` -. All in one line, using any operators you choose, subtract the month of your birth and the day of your birth from `myVariable`, divide it by 10, add 11, and then divide by 100. (Hint: You may need to use parentheses!) -. In a https://www.markdownguide.org/cheat-sheet/[markdown cell], write a sentence describing the number you got as the value of `myVariable` at the end of all these operations. Is there anything special about it? (It may or may not be an important date from your life!) - -.Deliverables -==== -- A code cell containing the 5 lines of code requested above, and a print statement showing the final value of `myVariable`. -- A markdown cell identifying what is special about the resulting number. -==== - -=== Question 2 (2 pts) - -While we'll cover control structures in greater detail in the next few weeks, let's introduce the basic concept so we can see the **power** of logical operators when used in conditionals! - -Conditionals are exactly what they sound like: blocks of code that perform actions _if_ we satisfy certain conditions. Creatively, we call these _if statements_. In Python, _if statements_ are structured like so: - -[source, python] ----- -# general structure -if (condition): - do this action - -# specific example -if (x > 0): - print("X is a positive number!") ----- - -For this question, we want you to use the operators we just learned to perform the following: -- define a variable `myYear` -- write an `if statement` that prints "Divisible by 4!" if `myYear` is divisible by 4 -- write an `if` statement that prints "Not divisible by 100!" if `myYear` is not divisible by 100 -- write an `if` statement that prints "Leap Year!" if `myYear` is divisible by 4 **AND** myYear is not divisible by 100 - -Here is some skeleton code to get you started (the first if statement is already completed): - -[source, python] ----- -myYear = 2000 - -if (myYear % 4 == 0): - print("Divisible by 4!") -if # continue your code here... ----- - -To check your work, here are the following test cases: - -- Year 2000 is divisible by 4, but not 100 -- Year 2020 is a leap year -- Year 1010 is not divisible by 100 or 4 - -.Deliverables -==== -- Three _if_ statements as described above. -==== - -=== Question 3 (2 pts) - -Let's continue to build on the foundational concept of _if_ statements. Sometimes, when our first condition is not true, we want to do something else. Sometimes we only want to do something else if _another_ condition is true. In an astounding feat of creativity, these are called _if/else/else-if_ statements, and here is their general structure: - -[NOTE] -==== -In Python, `elif` stands for "else if". -==== - -[source, python] ----- -# general structure (we can have as many elifs as we want!) -if (condition): - do this -elif (other condition): - do this instead -elif (third condition): - do this if we meet third condition -else: - this is our last option - -# we can also have no elif statements if we want! -if (condition): - do this -else: - do this instead - -# and finally, a concrete example -x = #some value -if (x > 100): - print("x is a really big number!") -elif (x > 0): - print("x is a positive number!") -elif (x < -100): - print("x is a really negative number!") -else: - print("x is a negative number") ----- - -Feel free to experiment with these examples, plugging in different values of `x` and seeing what happens. Learning to code is done with lots of experimentation, and exploring/making mistakes is a valuable part of that learning experience. - -Let's build on your code from the last problem to create an _if/else/else-if_ statement that is able to identify any and all leap years! Below is the definition of a leap year. Your task for this question is to take the below definition, and, defining a variable `myYear`, write an _if/else/else-if_ block that prints "Is a leap year!" if `myYear` is a leap year, and prints "Is not a leap year!" if `myYear` is not a leap year. - -[IMPORTANT] -==== -A year is a leap year if it is divisible by 4, but not 100, _or_ if it is divisible by 100 and 400. To put it in language that may make more sense in a conditional structure: - -If a year is divisible by 4, but not divisible by 100, it is a leap year. Else if a year is divisible by 100 and is divisible by 400, it is a leap year. Else, it is not a leap year. -==== - -[source, python] ----- -myYear = 2000 - -if # condition 1: - print("Is a leap year!") -elif # condition 2: - print("Is a leap year!") -else: - print("Is not a leap year!") ----- - -[NOTE] -==== -Here are some test cases for you to use to double-check that your code is working as expected. -- 2000, 2004, 2008, 2024 are all leap years -- 1700, 1896, 1900, and 2010 are all not leap years -==== - -.Deliverables -==== -- A conditional structure to identify leap years, and the results of running it with at least one year. -==== - -=== Question 4 (2 pts) - -Okay, we've learned a lot in this project already. Let's try and master the concepts we've been working on by making a more concise version of the conditional structure from the last problem. Here are the rules: you must create a conditional structure with only one _if_ and only one _else_. No _elifs_ are allowed. It has to accomplish fundamentally the same task as in the previous question, and you may use the test cases provided in the previous question as a way to validate your work. Some basic skeleton code is provided below for you to build on: - -[source, python] ----- -myYear = 2000 - -if # condition - print("Is a leap year!") -else: - print("Is not a leap year!") ----- - -.Deliverables -==== -- A shortened version of the conditional structure from the last problem, and the results of running it with at least one year. -==== - -=== Question 5 (2 pts) - -Great work so far. Let's summarize what we've learned. In this project, we learned about the different types of operators in Python and how they are used, what conditional statements are and how they are structured, and how we can use logical and comparison operators in conditional statements to make decisions in our code! - -For this last question, we are going to use what we've been building up this entire project on some real world data and make observations based on our work! The below code has been provided to you, and contains a few new concepts we are going to cover in next week's project (namely, `for` loops and lists). For now, you don't have to understand fully what is going on. Just insert the conditions you wrote in the last problem where specified to complete the code (you only have to change lines with `===` in comments), run it, and write at least 2 sentences about the results of running your code and any observations you may have regarding that output. Include in those two sentences what percentage of the Olympics were held on leap years. (If you are interested in understanding the provided code, feel free to take some time to read the comments explaining what each line is doing.) - -[IMPORTANT] -==== -The Olympics data can be found at "/anvil/projects/tdm/data/olympics/athlete_events.csv" -==== - -[NOTE] -==== -In the below code, you may have noticed the addition of `.unique()` when we're getting a list of years from our data. We'll refrain from covering this in detail until a future project, but what you can know is that here it takes our list of all years and removes all the duplicate years so we have only one of each year in our resulting `year_list` -==== - -[source, Python] ----- -import pandas as pd - -olympics_df = # === read the dataset in here === - -# get a list of each year in our olympics_df, -# and then use .unique() to remove duplicate years -year_list = olympics_df["Year"].unique() - -# create an empty list for our results -leap_list = [] - -# apply our conditional to each year in our list of years -for year in year_list: - if # === add your condition for leap years here === - # add the year to our list of leap years - leap_list.append(year) - else: - # if its not a leap year, do nothing - pass - -# prints our list of leap years and number of leap years -print("The Olympics were held on leap years in:", sorted(leap_list)) -print(len(leap_list), "of the", len(year_list), "Olympics occurrences in our data were held on a leap year.") ----- - -.Deliverables -==== -- The results of running the completed code -- At least two sentences with observations about the results and what percent of Olympics are held on leap years -==== - -== Submitting your Work - -Great job, you've completed Project 2! This project was your first real foray into the world of Python, and it is okay to feel a bit overwhelmed (I know I was at first!). Python is likely a new language to you, and just like any other language, it will get much easier with time and practice. As we keep building on these fundamental concepts in the next few weeks, don't be afraid to come back and revisit your previous work or re-read sections of project instructions. As always, please ask any questions you have during seminar, on Piazza, or in office hours. We hope you have a great rest of your week, and we're excited to keep learning about Python with you in the next project! - -.Items to submit -==== -- firstname_lastname_project2.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project3.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project3.adoc deleted file mode 100644 index bca45ea23..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project3.adoc +++ /dev/null @@ -1,253 +0,0 @@ -= TDM 10100: Python Project 3 -- 2024 - -**Motivation:** So far, we've learned how to set up and run code in a Jupyter Notebook IDE (integrated development environment), perform operations, and set up basic decision structures (_conditionals_) in our code. However, all we've really done so far is just define one value at a time to pass into our conditionals and then changed that value by hand. As you probably realized, this is inefficient and completely impractical if we want to handle lots of data, either iteratively (aka one-by-one) or in some other efficient method (i.e in parallel, by grouping, etc. More on this later...). This project will be dedicated to learning about looping structures and vectorization, some common approaches that we use to iterate through and process data instead of doing it by hand. - -**Context:** At this point, you should know how to read data into a Pandas dataframe from a .csv file, understand and be able to write your own basic conditionals, and feel comfortable using operators for logic, math, comparison, and assignment. - -**Scope:** For Loops, While Loops, Vectorized operations, conditionals, Python - -.Learning Objectives: -**** -- Learn to design and write your own `for` loops in Python -- Learn to design and write your own `while` loops in Python -- Learn about "vectorization" and how we can use it to process data efficiently -- Apply looping and vectorization concepts to real-world data -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/datasets/tdm_temp/data/whin/observations.csv - -== Questions - -=== Question 1 (2 pts) - -Let's first discuss two key foundational concepts to understand when working with loops: lists and indexing. Lists are relatively intuitive: they are a list of elements. In Python, lists can contain elements of different data types like strings (`x = "Hello World"`) and integers (`y = 7`), however it is very common practice for lists to only have elements of one data type (i.e. either all integers, or all strings). Lists in Python are constructed like so: - -[source, python] ----- -# create a list with three integer elements -list1 = [23, 49, 5] - -# create a list with three string elements -list2 = ["hello", "Mister", "Foobar"] - -# create an empty list -list3 = [] ----- - -Now we should talk about the interesting part of lists: adding, subtracting, and accessing elements in lists! Adding and removing elements from lists in Python can be done a number of ways, often by using built in methods like `.append()`, `.insert()`, `.extend()`, `.remove()`, and `.pop()`. Here are some more exhaustive descriptions of these methods, and examples on how to use them for https://www.w3schools.com/python/python_lists_add.asp[adding] and https://www.w3schools.com/python/python_lists_remove.asp[removing] elements from lists. - -Accessing elements of a list, also called _indexing_, is different from language to language. In Python, lists are _0-indexed_, meaning the first element in a list is at **index 0**. Accessing elements of a list can be done using square brackets. We can also _slice_ a list, which is simply indexing in such a way that we grab a chunk of elements from the list, as opposed to just one. Some basic examples are shown below, and you can read more about indexing in Python https://www.w3schools.com/python/python_lists_access.asp[here]. - -[source, python] ----- -# create our list, then append a new element -list1 = ["Jackson", "is terrified of", "spiders"] -list1.append("and cockroaches") -print(list1) # this will print ["Jackson", "is terrified of", "spiders", "and cockroaches"] - -# print a few elements of our list using indexing -print(list1[0]) # prints "Jackson" -print(list1[2]) # prints "spiders" -print(list1[3]) # prints "and cockroaches" - -# slice our list to get the two middle elements -print(list1[1:3]) # prints ['is terrified of', 'spiders'] - -# slice our list to get every other element -print(list1[::2]) ----- - -In a more "big data" sense, you can also index into Pandas dataframes! This can be done numerically, like we did with regular lists, or by the name of the column! Below is an example of how we did this in a previous project: - -[source, python] ----- -import pandas as pd - -# read in our data -olympics_df = pd.read_csv("/anvil/datasets/tdm_temp/data/olympics/athlete_events.csv") - -# index into the dataframe and get the "Year" column -year_list = olympics_df["Year"] ----- - -[IMPORTANT] -==== -Not all data is the same! `.csv` stands for `comma-separated-values`, and as such, the `read_csv` function that we've been using is looking for commas between each bit of data. However, commas are only one valid separator, and many data files will use pipes `|` or even just spaces to separate data instead. Our `pd.read_csv()` function can still read these in, but you'll have to specify the separator if its not commas. For pipe-separated data (like in this project), you can use something that looks like `pd.read_csv("data.csv", sep="|")` -==== - -For this problem, we are going to introduce some new data from https://data.whin.org/[WHIN], a large weather analysis organization that helps integrate weather and agricultural data in the Wabash Heartland region (that's all around Purdue!). Your tasks are as follows: - -- read the data from "/anvil/datasets/tdm_temp/data/whin/observations.csv" into a dataframe called `obs_df`(Hint: Don't forget to specify the separator!) -- index into your `obs_df` dataframe, and store the "temperature_high" column to a new variable called `tempF_list` -- With your newly formed `tempF_list`, print the 101st element - -[NOTE] -==== -If you want to take a look at a small summary of your dataframe, the `head()` method will print the first 5 rows of your data, along with the names of the columns of your data (if they exist). The syntax for this is `obs_df.head()` -==== - -.Deliverables -==== -- a new Pandas dataframe called `obs_df` -- a new list that is the temperature_high column of `obs_df` called `tempF_list` -- the 101st element in the `tempF_list` -==== - -=== Question 2 (2 pts) - -Now that we have some idea about how we can store lists of data, let's talk about repetitive tasks. To put it concisely: repetition is bad. When writing code, there should be a focus on avoiding unnecessary repititions, which will help ensure readability and good formatting along with improving our code's speed. When it comes to avoiding repititions, looping structures save the day! - -There are two basic kinds of loops in Python: `for` loops, and `while` loops. Their names also encapsulate how they work; `for` loops do some actions _for_ each item in some set/list of items. `while` loops perform some actions _while_ some condition is true. Below are a few basic examples of how these structures can be used with lists. - -[source, python] ----- -ourlist = ["One-eyed", "One-horned", "Flying Purple", "People Eater"] - -# this goes through each number from 0 to 4 and uses it to index into our list -for i in range(4): - print("The value of i:", i) - print("List element", i, ":", ourlist[i]) - -# we can also iterate directly through a list in Python, like this -for j in ourlist: - print(j) - -# if we introduce a counter variable, we can do the same thing with a while loop! -counter = 0 -while (counter < len(ourlist)): # len(ourlist) gives us the length of our list - print("The value of counter:", counter) - print("List element", counter, ":", ourlist[counter]) - counter += 1 # if you don't update counter, the loop runs forever! ----- - -While `for` and `while` loops can often be used to perform the same tasks, one of them will often present a more intuitive approach to completing a task that is worth thinking about before diving straight into the problem. - -[NOTE] -==== -`range` can be used to go over every other element, every third element, and more. It can also start at a specified index. For example, `range(1,20,2)` will make a list from 1 to 20, going up by 2 each time. The result looks like `[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]`. For more information on `range`, click https://www.w3schools.com/python/ref_func_range.asp[here]. -==== - -[NOTE] -==== -`enumerate()` is another useful Python function for working with loops and lists! By using enumerate on a list, we can iterate through it and assign both its current element and the index of that element available to us! `for ind, el in enumerate(list)` will store the current index to a variable called `ind` and the current element to a variable called `el`. https://www.geeksforgeeks.org/enumerate-in-python/[Here's a quick website] with some good examples! -==== - -Here are a few basic tasks to complete for this problem to get you more familiar with looping: - -- Construct a list of length 10. Call it `mylist`. The elements can be anything you want. -- Using a `for` loop, change all of the even-index elements of the list to be the string "foo" (You can consider `0` to be even) -- Using a `while` loop, change all of the odd-index elements of the list to be the string "bar" -- Using a `for` loop, change all of the elements whose index are divisible by 3 to be "buzz" (Hint: Use `enumerate()`!) -- print the final list `mylist` after making all of the above changes - -[NOTE] -==== -Your final list should be `['foo', 'bar', 'foo', 'buzz', 'foo', 'bar', 'buzz', 'bar', 'foo', 'buzz']` -==== - -.Deliverables -==== -- a list, `mylist`, of length 10, where each element is either foo, bar, or buzz based on the above instructions -- a print statement that prints `mylist` -==== - -=== Question 3 (2 pts) - -Let's bring the looping we just learned to the real-world data we read into our `obs_df` dataframe from Question 1! In this problem, we're going to use looping to perform two tasks. One of these tasks is better suited for a `while` loop, and the other is better suited for a `for` loop. You can get full credit no matter which loop you use for which task. Just ensure that you use each loop only once, and that you complete the tasks' deliverables. - -. If you're an in-state student, you likely didn't have any problem with the temperatures we looked at earlier. However, for most of the rest of the world, it certainly would be a concern to see a number like `63` on their thermometer! For this task, we want you to take the list you created in question 1, `tempF_list`, convert each value to celsius, and store them to a new list called `tempC_list`. (Conversion from Fahrenheit to Celsius is simply `Cels = (Fahr - 32) * 5/9`) - -. With our newly created `tempC_list`, we now have a list of temperatures around the Wabash heartland that are in a more accessible form. However, we want to do more than just unit conversion with this data! For this task, print a count of how many times in `tempC_list` the temperatures are higher than 24 degrees Celsius. Also print what percentage of the elements in our list are greater than 24 degrees Celsius (Hint: % = (count / total) * 100) - -.Deliverables -==== -- The `tempF_list` from Question 1 converted to Celsius -- The number of temperatures in `tempC_list` greater than or equal to 24 degrees Celsius -- The percentage of `tempC_list` greater than or equal to 24 degrees celsius -==== - -=== Question 4 (2 pts) - -Fantastic! We learned what loops were, used them on a few small lists of our own creation, and then successfully applied them to real-world data in order to complete practical tasks! At this point, you're probably thinking "Wow! Lists are super useful! I'm so glad I learned all there is to know and I never have to learning anything else again!" - -...But what if I told you there was an even better way to work with lists? Introducing: vectorization. When we want to perform common actions to every element in a list, array, dataframe, or similar, Python and the Pandas library presents us with easy ways to do that action, in parallel, to all the items in our list. This is not only a lot easier to read than a loop (it takes about 1 line of vectorized code to do the same task as the 3-4 lines of looping we wrote earlier), its also a lot more efficient, as there are some neat tricks going on behind the scenes to speed things up. The concept here is pretty straightforward but has a lot of depth to it, so feel free to read more about it https://pythonspeed. -com/articles/pandas-vectorization/[here]. - -In the same vein of thinking, we can also slice our lists/arrays/dataframes based on conditions. This also ends up being a lot more readable and efficient than looping, and is only a slight extension to the idea of slicing we covered earlier in this project. - -Below are some examples that are relevant to the tasks you'll be working on during this problem. - -[source, python] ----- -# read in the data -obs_df = pd.read_csv("/anvil/datasets/tdm_temp/data/whin/observations.csv", sep="|") - -# use vectorized operations to create a new column in our -# dataframe with temperatures converted to the rankine scale -obs_df["temperature_Rankine_high"] = obs_df["temperature_high"] + 459.67 - -# use vectorized operations to create a new column in our dataframe called temperature_under75_high -obs_df["temperature_under75_high"] = obs_df["temperature_high"][obs_df["temperature_high"] < 75] - -# print the first few entries in our new column -print(obs_df["temperature_Rankine_high"].head(3)) -print(obs_df["temperature_under75_high"].head(3)) ----- - -For this problem, create a new column in your dataframe called `myaverage_temp`. This column should be the sum of the `temperature_high` and `temperature_low` divided by 2. - -[NOTE] -==== -If you run `print(obs_df["myaverage_temp"].head())`, the first five elements in the column should be 70.5, 69.5, 76.5, 76, and 76. -==== - -.Deliverables -==== -- a new column, `myaverage_temp`, that is the average of the `temperature_high` and `temperature_low` columns -==== - -=== Question 5 (2 pts) - -Let's finish up this project by taking the loops we wrote in Question 3 and rewriting them as one-line vectorized operations. Let's briefly rehash the loops we need to vectorize for this problem. - -. Write a one-line vectorized operation that creates a new column, `temperature_high_celsius`, that is the `temperature_high` column with its values converted from Fahrenheit to Celsius. -. Write a one-line vectorized operation that creates a new column, `my_hightemps`, with all of the values from the `temperature_high_celsius` that are greater than or equal to 24 degrees celsius -. Print the head of each of your new columns (hint: this is demonstrated in the previous question) - -The example code provided in the previous problem is quite similar to what you're being asked to do in this problem, so feel free to use it as a starting point! - -[NOTE] -==== -There are a few different possible correct results for the second task in this problem. For example, it is okay if your column has the temperature when it is greater than 24 degrees Celsius, and `NA` when it is not. -==== - -.Deliverables -==== -- The `temperature_high_celsius` column as described above -- The `my_hightemps` column as described above -- The heads of both columns -==== - -== Submitting your Work - -Whew! That project was tough! Looping, indexing, and vectorization are extremely important and powerful concepts, and its no small feat that you made it through this project! If you still feel that it would be tough for you to write a loop or vectorized operation from scratch, consider going back and slightly modifying questions, coming up with your own problems and solutions as practice. - -Next week we will slow down a bit and talk about _semantic structure_, the art of writing and commenting your code so it is beautiful, readable, and easy to understand. If these last couple projects have been a bit intense, this next one should be a welcome relief. As always, attend seminar, post to Piazza, and otherwise come to some office hours and get any and all the help you need! I hope that you are enjoying the class so far, and I look forward to continuing to learn with you all next week. - -.Items to submit -==== -- firstname_lastname_project3.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project4.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project4.adoc deleted file mode 100644 index 52d6fa79a..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project4.adoc +++ /dev/null @@ -1,178 +0,0 @@ -= TDM 10100: Python Project 4 -- 2024 - -**Motivation:** Being able to write Python to perform tasks and analyze data is one thing. Having other people (or you, in the future) read your code and be able to interpret what it does and means is another. Writing clean, organized code and learning about basic syntax and whitespace rules is an important part of data science. This project will be dedicated to exploring some syntactic and whitespace-related rules that we've glossed over in previous projects, along with exploring some industry standards that are good to keep in mind when working on your own projects - both for this class and in the rest of your life. - -**Context:** We'll continue to use conditionals, lists, and looping as we move forward, but we won't be spending as much time on reviewing them individually. Feel free to review past weeks' projects and work for refreshers, as the groundwork we've laid up to this point will be the foundations we build on for the rest of this semester. - -**Scope:** Syntax, whitespace, nesting/code blocks, styleguides - -.Learning Objectives: -**** -- Know what syntax is and why its important -- Understand the role whitespace plays in Python and how to use it -- Develop some basic ideas about how to make your code look cleaner and limit nesting/spaghetti code -- Read up on some basic industry standards regarding style -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/olympics/athlete_events.csv - -== Questions - -=== Question 1 (2 pts) - -Firstly, let's explore _syntax_. Syntax is, simply put, a set of rules that are agreed upon to dictate how to structure a language. In the code world, _syntax_ often refers to the specific **keywords** that reside in each programming language, along with what symbols are used for things like operators, looping and conditionals, functions, and more! Additionally, syntax can refer to spacing. For example, while the below code is valid and will not produce errors if run by a _C_ compiler that is following _C syntax_, it would not work at all if run by a Python interpreter following _Python syntax_: - -[source, C] ----- -for (i=0;i<10;i++) { - printf("We're on loop %d",i); - } ----- - -Another good example would be operators. In `R`, for example, the modulus operator is `%%`. In `Python`, however, you (hopefully now) know that the modulus operator is just `%`. Below is some Python code using concepts we covered in previous projects. However, it has some syntax errors in it that make it error out. Your task in this question is to find the syntax errors, correct them, and run the code to figure out the secret sentence that is printed when the code is correct! (Hint: each line has one syntax error to fix, for a total of 7 syntax errors to fix! Running the code should give you hints as to what the errors are.) - -[source, python] ----- -secret == ["P", "i", "ur", "s", "du", " a", "e ", "ma", "Dat", "z", "a", "i", " M", "ng", "ine ", "!"] - -for i in range(0, len(secret), 2) - print(secret[i], end="" -for i in range(1, len(secret)) { - if i % 2 = 1: - print(secret(i), end="") -} ----- - -.Deliverables -==== -- The fixed version of the above code, and the secret sentence that results from running it -==== - -=== Question 2 (2 pts) - -As we move from the idea of syntax onto style and code cleanliness, let's first discuss a unique feature of Python: `whitespace`. While many other languages like C, R, JavaScript, Java, C++, and Rust provide explicit characters that denote when code is "inside of" a loop or conditional (think brackets `{}`, parentheses `()`, etc.) Python does not. Instead, Python uses the amount of `whitespace` before the code on that line starts in order to determine whether or not it is within something else. We've been doing this automatically in previous projects, but now let's explore it more intentionally. Take a look at the two examples below: - -[source, python] ----- -# example 1 -for i in range(5): - print("Loop Number", i) - print("Loop complete!") - -# example 2 -for i in range(5): - print("Loop Number", i) -print("Loop complete!") ----- - -As you can see by running this code in a Jupyter notebook, the results of each example are drastically different based only on indentation. - -[IMPORTANT] -==== -The amount of space before an indented piece of code is important. While the author of this project is fond of using `tab`s for his indentation, using 2 spaces or 4 spaces for each level of indentation is also quite common. However, you **CAN NEVER MIX** different styles of indentation. If part of your code is indented using tabs, and part is indented using spaces, it will not run. -==== - -While often the Python interpreter will catch errors in your code in advance and stop the code from running, this is not always the case (as demonstrated in the above examples). Many times, when Python whitespacing is not done as intended, errors that don't stop your code from running will happen. These are often called 'runtime errors' and can be tricky to catch until they start causing unintended results in your code. - -Below is some Python code to count the number of times the number "4" appears in a randomly generated list of 1000 numbers. However, this code contains 2 whitespace errors. Fix it so that it correctly counts the number of times "4" appears in our list. - -[source, python] ----- -import random - -# generate a 1000 number list of random numbers from 1-100 -number_list = random.choices(range(1,100), k=1000) -count = 0 - -for number in number_list: - if number == 4: - print("4 Detected!") - count += 1 - - print("Loop complete! Total number of 4's:", count) ----- - - -.Deliverables -==== -- Results of running the code above after correcting the two whitespace errors present -==== - -=== Question 3 (3 pts) - -Great! We now have a more formal idea behind the indentation we've been doing throughout our projects so far. Now let's explore the concept of `nesting`. `Nesting` is when some code falls 'within' other code. For example, actions within a conditional or a for loop are nested. Generally, we try and keep nesting to a minimum, as tracking 10 levels of indentation in your code to see what falls within where can be quite difficult visually. Here is an important example to prove that being careful while nesting is necessary, using the Olympics data we used in a previous project: - -[source, python] ----- -import pandas as pd - -# read in our olympics dataframe -olympics_df = pd.read_csv("/anvil/projects/tdm/data/olympics/athlete_events.csv") - -# pick just the olympian from row 3 of our dataframe -my_olympian = olympics_df.iloc[3] - -# what does any of this mean? Very unreadable, bad code -if my_olympian["Sex"] == "M": - if my_olympian["Age"] > 20: - print("Class 1 Athlete!") - if my_olympian["Age"] < 30: - print("Class 2 Athlete!") - if my_olympian["Height"] > 180: - if my_olympian["Weight"] > 60: - print("Class 3 Athlete!") - print("Class 4 Athlete!") ----- - -If you think this code is unreadable and its hard to tell what it means to be a class 1 vs 2 vs 3 vs 4 athlete (classes entirely made up), you're correct. Nesting unnecessarily and in ways that don't make code easy to read can quickly render a decent project into unreadable spaghetti. - -Take a good look at the above code. Are there any unnecessary classes that mean the same thing? How could you rewrite it using all that you've learned so far to make it more readable (for example, using _else-if_ and _else_)? For this question, copy this code into your Jupyter notebook and make changes to render it readable, reducing nesting as much as possible. Your final code should have the following features: - -- 3 classes, with the one unnecessary class removed -- No more than a maximum level of nesting of 2 (aka, 3 indents on the most indented line) -- Should produce the same results as the messy code (minus the unnecessary class) - -[NOTE] -==== -One good way to test your work here would be to run your clean version and the messy version on a couple different olympians (by changing `X` in the `my_olympian = olympics_df.iloc[X]` line) and making sure both versions produce the same results. -==== - -.Deliverables -==== -- A cleaned up version of the messy code provided -- The results of running both clean and messy versions of the code on the same athlete -==== - -=== Question 4 (3 pts) - -For our last question on this project, we want you to explore some different style conventions suggested as standards for writing Python, and write about a few that sound interesting to you. Please visit https://peps.python.org/pep-0008/[this official Python Style Guide] and pick 3 different conventions discussed in the guide. For each convention, write a snippet of code that demonstrates the convention. At the end of the question, in a markdown cell, write at least a sentence or two about each convention describing what it is and why it is important. - -.Deliverables -==== -- 3 Python code snippets demonstrating three different style conventions -- a markdown cell with at least 3-6 sentences describing the conventions picked and their utility -==== - -== Submitting your Work - -If you're at this point, you've successfully capped off our introduction to whitespace, nesting, and styling code in Python. Leaving this project, you should have a better understanding of a lot of the less straightforward elements of writing code and how more abstract concepts like style and indentation can drastically affect the quality of your code, even if it functions as intended. Remember that this was only an introduction to the topics, and throughout your career you'll always be picking up new tricks and style conventions as you gain more experience and meet new people. - -Next week, we'll look more deeply at variables, variable types, and scope, and learn how profound the statement `x = 4` in Python really is! - -.Items to submit -==== -- firstname_lastname_project4.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project5.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project5.adoc deleted file mode 100644 index 7070c78cb..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project5.adoc +++ /dev/null @@ -1,180 +0,0 @@ -= TDM 10100: Python Project 5 -- 2024 - -**Motivation:** So far in this class, we've been storing values to variables without really discussing any of the specifics of what's going on. In this project, we're going to do a detailed investigation of what _variables_ are, the different _types of data_ they can store, and how the _scope_ of a variable can affect its behavior. - -**Context:** There will be callbacks to previous projects throughout this project. Knowledge of basic operations with reading in and working with dataframes, constructing conditionals and loops, and using vectorized functions will be used in this project. - -**Scope:** Variables, types, and scoping - -.Learning Objectives: -**** -- Understand the concept of variables more widely -- Know the common types in Python and how to use them -- Understand what scoping is and basic best practicess -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/bay_area_bike_share/baywheels/202306-baywheels-tripdata.csv - -== Questions - -=== Question 1 (2 pts) - -A variable, at its most foundational level, is simply a _named area in memory_ that we can store values to and access values from by referencing its given name. In Python, variable names are considered valid as long as they start with either an upper/lowercase letter or an underscore in addition to not being any of the _Reserved Keywords_ built into Python (examples of reserved keywords are `True`, `None`, `and`, and more. A full list of reserved keywords can be https://realpython.com/lessons/reserved-keywords/[found here]) - -To review some of the concepts we've used in previous projects, your task in this question is to perform the following basic operations and assignments using variables: - -- Create a variable named `myname` and assign it the value of your name -- Create a variable named `myage` and assign it the value of your age -- Create a variable named `my_fav_colors` and assign it your top 3 favorite colors -- Create a variable named `about_me` and assign to it a list containing `myname`, `myage`, and `my_fav_colors` -- print `about_me` - -.Deliverables -==== -- The four lines of code specified above, and the results of running that code -==== - -=== Question 2 (2 pts) - -Alright, let's quickly review your work from the last problem. In your assignment statements, you likely used quotes around your name, nothing around your age and brackets around your lists. But why did you do that? - -The answer: `types`. Data can come in different types, and each type of data has a specific notation used to denote it when writing it. Let's quickly run through some basic types in Python: - -- Strings, or `str`, are used to store text data like `"Hello World!"`` -- Integers, or `int` are used to store whole number data like `5` and `1000` -- Floats, or `float` are used to store decimal numbers like `5.534234` or `0.1` -- Lists, or `list` are used to store lists of values. In Python, lists can contain different types at the same time. As demonstrated in the previous example, lists can also contain other lists! -- Booleans, or `bool` are logical truth values. The two main Python booleans are `True` and `False` -- Sets, dicts, tuples and more data types also exist in Python! In the next project we'll cover sets, dicts, and tuples in greater detail, as they can be very useful for organizing data. For now, just keep in mind that they exist. - -That's a lot, and these are just the basic types in Python! When we import a library like `pandas`, we also get any of the types they define! `Pandas dataframes` are their own type as well, and each column of a Pandas dataframe also typically has a type! - -Let's take a look at some real data and types. Read the Baywheels dataset (located at "/anvil/projects/tdm/data/bay_area_bike_share/baywheels/202306-baywheels-tripdata.csv") into a Pandas dataframe called `bike_data`. - -Once you've read the data in, use the https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html[.dtypes property] to list the data type of each column. Then use `.head()` to print the first few rows of our dataframe. - -Read through this https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes[official Pandas documentation] and, in a markdown cell, give a brief description of the types of data present in our dataframe and what they are used for. You can use `ctrl+f` to search for the type on the documentation page and read the descriptions given. Write at least a sentence or two on each type in our dataframe. - -.Deliverables -==== -- A sentence or two on each of the types in our dataframe and what they are used to store -==== - -=== Question 3 (2 pts) - -Fantastic, we've now got a feel for the different types available in our data. As a bit of an aside, let's spend this question cleaning up our data before we start experimenting on it. When you printed the head of our dataframe, you likely observed a few `NaN` (Not a Number) entries. Often, for some reason or another, our data won't always have every column filled in for each row. This is okay, and we will explore ways to handle missing data in future projects, but for now let's learn how to isolate the data that is complete. - -First off, note the size of the dataframe currently. If you don't remember how to do this, we introduced the function in project 1. Refer to the documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html[here] for additional help. - -Next, read through https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html[this documentation page] to get a feel for how the `dropna()` method works. Be sure to scroll down to the bottom of the page to see a few basic examples. Apply the `dropna()` method to our dataframe. - -Finally, print the size of the dataframe after dropping all the incomplete rows. How much smaller did it get? In a markdown cell, list the beginning and ending sizes of our dataframe, how many rows were dropped for containing bad values, and what percentage of our total dataframe was dropped. - -.Deliverables -==== -- Size of `bike_data` before and after `dropna()` -- Percentage of `bike_data` lost to `dropna()` -- Number of incomplete rows in our original `bike_data` dataframe -==== - -=== Question 4 (2 pts) - -Now that we've read in our data, understand the types available to us, and have cleaned up our nonexisting data, let's begin analyzing it and understanding how variables interact with operators. _Generally_, it is not good practice to try and apply operatons to different variables with different types (i.e. `"hello world!" + 5`) and the Python interpreter will typically stop you from doing this. Between two variables of the same type, however, many operators have defined behaviors that we haven't yet explored. - -For example, in previous projects we've used mathematical operators like `+` on integers and floats. However, many operators also have defined behavior with strings! Run the following code, and observe the output: - -[source, python] ----- -var1 = "My name is " -var2 = "Firstname" -var3 = "Lastname!" - -sentence = var1 + var2 + " " + var3 -print(sentence) ----- - -The above example is one of _concatenation_, the joining of two or more strings together, and has powerful practical applications. - -Let's explore the power of concatenation. Consider our bike data: if we want to figure out how many bikes we should put at each station, we'll likely need to understand which stations are used most often. Furthermore, we may want to know what trips are made most often, so that we can put more e-bicycle charging ports at spots along those trips. In order to find out what trips are made most often, we _could_ just count the number of trips that have both the same `start_station_id` and `end_station_id` _or_ we could construct a new column from those two columns, and then count our new "compound column" instead, which has the potential for making our code run a _lot_ faster. - -Take a look at the below example, where I am adding the `ride_id` and `rideable_type` columns to create a new column called `id_and_type` and then getting a count of the different id-type combos in our dataframe. Using a very similar structure, combine the `start_station_id` and `end_station_id` columns into a new column called `trip_id`, and return the top 5 trip IDs in our data. - -[source, python] ----- -# create new column -bike_data["id_and_type"] = bike_data["ride_id"] + "|" + bike_data["rideable_type"] - -# print dataframe to observe new column -print(bike_data.head(2)) - -# get count of top 5 values for each id-type combo in ascending order -# (note there is only one of each combo) -bike_data["id_and_type"].value_counts(ascending=False).head() ----- - -.Deliverables -==== -- A new column in `bike_data` called `trip_id` -- A count of the top 5 trip IDs in the data -==== - -=== Question 5 (2 pts) - -As a way to finish up this project, let's solve a problem and introduce an important concept that will be extremely relevant in the next few weeks: scope. Scope, simply put, is the level at which a variable exists. Variables with larger scope can be referenced in a wider amount of settings, whereas variables with extremely small scope may only be referenceable within the loop, function, or class that they are defined in. In Python, scope really only exists in regards to functions. We'll cover functions in detail soon, but for now, just note that they are similar to loops in that they have a header (similar to `if` or `for`) and body (code indented that is 'inside' the function). When variables are defined in a function, they don't exist outside that function by default. However, rather uniquely to Python, variables defined in loops do exist outside the loop by default. - -As a quick example, run the following code in your Jupyter notebook: - -[source, python] ----- -for i in range(5): - # do nothing - pass - -# shows that i exists even after the for loop ends -print(i) - -# define a function -def foo(): - # inside our function, define a variable then end function - bar = 3 - return - -# run our function, then try and print bar -# notice that bar does not exist outside the function's body -# so we get an error -foo() -print(bar) ----- - -After you run that code in your notebook, give https://www.w3schools.com/python/python_variables_global.asp[this webpage] a read. In a markdown cell, write a sentence or two about what making a variable `global` does. Then, write a sentence or two about how we could use `global` to make `bar` defined, even outside of our function's body. Again, you don't have to understand deeply how functions work at this point. - -.Deliverables -==== -- The results of running the above code -- A sentence or two on the `global` keyword -- A sentence or two on how to make `bar` exist outside of `foo()` -==== - -== Submitting your Work - -Now that you've completed this project, you hopefully have a much more in-depth understanding of variables and data types along with an introduction to data cleaning and variable scope! This project was quite broad, and next week we will be back to laser-focusing with a detailed investigation into dictionaries, sets, and tuples, three data types we mentioned in this project but warrant their own investigation. After that we'll be moving onto arguably the most important concept in all of code: functions. - -We are getting close to halfway through the semester, so please make sure that you are getting comfortable developing a workflow for these projects and learning the concepts incrementally. A lot of these concepts are very hierarchical: they build on top of each other. If you struggled with something in this project or any of the prior ones, I would encourage you to take advantage of one of the many avenues for getting advice or the opportunity to work with one of our TAs or Dr. Ward, so that going forward you are on the best possible footing for upcoming projects. Have a great rest of your week, and I look forward to working with you all in the next project. - -.Items to submit -==== -- firstname_lastname_project5.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project6.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project6.adoc deleted file mode 100644 index d70f8a541..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project6.adoc +++ /dev/null @@ -1,177 +0,0 @@ -= TDM 10100: Python Project 6 -- 2024 - -**Motivation:** In previous projects we've employed lists as the main way to store lots of data to a variable. However, Python gives us access to plenty of other variable types that have their own benefits and uses and provide unique advantages in data analysis that are extremely important. In this project, we'll be exploring sets, tuples, and dictionaries in Python, focusing both on learning what they are and how to use them in a practical sense. - -**Context:** Understanding the basics of lists, looping, and manipulation of data in Pandas dataframes will be crucial while working through this project. - -**Scope:** Lists, sets, tuples, dicts, looping structures, Pandas - -.Learning Objectives: -**** -- Know the differences between sets, tuples, lists, and dicts in Python -- Know when to use each type of grouping variable, and common operations for each -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/youtube/USvideos.csv - -== Questions - -=== Question 1 (2 pts) - -Let's jump right into new topics with _dictionaries_. Dictionaries (and the other structures we'll be covering in this project) are essentially lists with slight modifications to their structure/properties. Conceptually, you can think of dictionaries as lists, where each element in the list is a pair of a key and a value associated with that key. For example, we may have a dictionary of names and ages for people. The keys, in this case, would be people's names, while the values would be their ages. An important thing to note is that keys **MUST** be unique - you cannot have duplicate keys - and thus using ages as the keys in our example would be much worse than using them as values. Take a look at the below code, where we make a dictionary and then print a couple of values. - -[source, python] ----- -# create a dictionary of names and ages -names_ages_dict = {"Mary Antoinette": 23, "Charles Darwin": 100, "Jimmy Hendrix": 45, "James Cameron": 69} - -# print the age of James Cameron -print(f"James Cameron is {names_ages_dict['James Cameron']} years old") ----- - -For this problem, read the `/anvil/projects/tdm/data/youtube/USvideos.csv` data into a Pandas dataframe called "US_vids". Print the head of that dataframe using `.head()`. You'll notice a "category_id" column in the data. That could be useful! But we don't really have any idea what those numbers mean. In this question and the next, we'll create a new column in our dataframe that has the names of those categories. - -To do this, take a look at https://mixedanalytics.com/blog/list-of-youtube-video-category-ids/[this website]. Create a new dictionary called id_names where the keys are the IDs and the values are the names of each category. Print the value for key 23 by indexing ino the dictionary (similar to how we did above). - -.Deliverables -==== -- The head of your new "US_vids" dataframe -- A dictionary with the names corresponding to each category ID. The IDs should be the dictionary keys -==== - -=== Question 2 (2 pts) - -Now that we have a dictionary that maps our IDs onto their names, we are ready to construct a new column in our dataframe. Luckily, Pandas provides us with a super useful method to perform this lookup and key-value matching for us: `.map()`. Read through http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html[this Pandas documentation on the method], and use your `id_names` dictionary to create a new column in your dataframe called "category". Once you've done so, print the head of your dataframe to ensure that the new column has been added as you expect. (Hint: https://stackoverflow.com/questions/29794959/pandas-add-new-column-to-dataframe-from-dictionary[This Stack Overflow] post may help guide you as well) - -Then use the https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html[`.value_counts()`] method to print a count of the different categories of our data, and sort it from most frequent to least frequent (Hint: We did this in Project 5, question 4). - -[NOTE] -==== -To validate your work, we will provide you the top 5 most frequent categories and how often they occurred: - -- Entertainment: 9964 -- Music: 6472 -- Howto & Style: 4146 -- Comedy: 3457 -- People & Blogs: 3210 -==== - -.Deliverables -==== -- A new column, `category` in the dataframe `US_vids` -- A count of categories in the data, sorted by most to least frequent -==== - -=== Question 3 (2 pts) - -Now that we've got a working understanding of dictionaries, let's talk about _sets_. If you're familiar with "set theory" in mathematics, you likely already know about these; if not, you're about to learn! A set is similar to a list in that it contains a series of elements. However, the main difference is that sets do not contain any duplicate elements, and they have no order. - -Sets are extremely useful in comparison with each other. For example, lets say I create two sets: A set of all my favorite colors and a set of all your favorite colors. If I wanted to see what colors were both my favorite and your favorite, I could find the "intersection" of those two sets. Python has a handy method that does this (and other common set operations) for us. - -In this problem, we want to figure out two things: - -. How often do videos with comments disabled have ratings disabled as well? -. What overlap is there between "comedy" videos and videos that have both comments and ratings disabled? - -As some guidance here, you could, for example, construct a set of videos that have comments disabled like so: - -[source, python] ----- -no_comment_vids = set(US_vids["video_id"][US_vids["comments_disabled"] == True]) ----- - -You could then use, for example, `.intersection()` to compare this set to the set of videos with ratings disabled, and compare the total number of videos with comments disabled to those with comments and ratings disabled. (For a full list of set methods, https://www.w3schools.com/python/python_ref_set.asp[click here]) - -[NOTE] -==== -If you wanted to easily get a set of videos with both comments and ratings on, you could use the intersection of the set of videos with comments on and the set of videos with ratings on. However, you could also get the difference between the set of all videos and the set of videos with either comments, ratings, or neither enabled, but not both. There are almost always multiple ways to solve things with sets. -==== - -.Deliverables -==== -- The proportion or percentage of videos with comments disabled that also have ratings disabled -- The proportion or percentage of "comedy" videos with both comments and ratings enabled -==== - -=== Question 4 (2 pts) - -Interesting. It looks like most comedy videos have most ratings and comments enabled. That makes sense, right? Comedians rely a lot on community feedback to improve their routines, so we would probably expect that they want to encourage things like leaving feedback and voting on whether they liked the video or not. However, we have a _LOT_ of categories in our data. Do you think this will hold for all the others? - -In this question, we want you to create a dictionary named `category_censorship` where the keys are the names of the categories in our data, and the values are the percentage of videos in that category that have both comments and ratings enabled. We've provided some starter code for you below, and if you use your work from the last question the actual amount of new code you'll have to write will be minimal: - -[source, python] ----- -# create empty dictionary -category_censorship = {} - -for category in set(US_vids["category"]): - # figure out how much of the category is censored using sets - # (Hint: This is very similar to the last problem) - - percent_censored = # Fill this in as needed - - category_censorship[category] = percent_censored - -# fancy printing to make results look nicer -for key, val in category_censorship.items(): - print(f"{format(key, '21s')} is {format(val, '.2f')}% uncensored") ----- - -Be sure to print your final results for the category. If you want to make things look better, you can try and sort your dictionary based on percentage of censored videos, and even make pretty formatting for your printed results, but you don't need to in order to get full credit for this problem. - -.Deliverables -==== -- Your printed `category_censorship` dictionary, defined as described above. -==== - -=== Question 5 (2 pts) - -Let's finish up the project by discussing tuples. Tuples are very unique in that they are almost identical to lists. They are a collection of elements, they can contain elements of all the same type or different types, and they are ordered. The differences between lists and tuples, however, are quite meaningful. Once a tuple has been created, it is **immutable**, meaning it can't be changed. You can't add elements, you can't remove elements, and you can't modify elements. - -You might be wondering what the utility of tuples is at all; after all, so far the dynamic nature of data structures has been a strength, not a weakness. One of their strongest uses is one we've already been taking advantage of without really acknowledging it: data storage. When we store data in a tabular format, we have defined columns and rows, where each column stores only one type of data, and each row stores an "entry". A solid example of this would be our current Youtube data. If you were to add in some new data, you would want it to always be of some fixed length (with entries or empty spaces for each column). You could use a tuple to store each row of data, and then insert the tuple into your existing dataframe/table. - -For this question your task is to create your own table. Choose some subset of the `US_vids` dataframe (for example, comedy videos only) and create a table using tuples for the rows and a list to store all the rows. Be sure that the first row in your table is made up of the column headers. - -To complete the question, run the relevant section of the below code to print out the first 5 entries of your table. - -[NOTE] -==== -If you're struggling at figuring out how to do this, take a look at https://www.amelt.net/en/iwm/programming-iwm/en-python/6156/[this page] for a good starting point. -==== - -[source, py] ----- -# if you use a list to store your rows, run this: -for index, row in enumerate(mytable[0:5]): - print(f"Row {index} ID: {row[0]}") ----- - -.Deliverables -==== -- A table of your own design that uses tuples to store data -- The results of running the provided print statements -==== - -== Submitting your Work - -This project caps our section of the course on basic variable types and group-based variables in Python. In closing out this project, we have learned the basic variable types available to us, common use cases for each, and how we can practically apply them in order to store, access, manipulate, and analyze data in an organized and efficient manner. - -In the next series of projects, we'll be diving into one of the deepest, most important parts of all of data science in Python: functions. These upcoming projects will be an amalgamation of everything you've learned so far, and once you have functions under your belt you'll really have all the basic tools native to Python that you need. Be sure you understand everything so far, as the next projects will continue to challenge and expand on what we've learned. Never hesitate to reach out for assistance as needed. See you next week! - -.Items to submit -==== -- firstname_lastname_project6.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project7.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project7.adoc deleted file mode 100644 index 3f0437f3c..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project7.adoc +++ /dev/null @@ -1,225 +0,0 @@ -= TDM 10100: Python Project 7 -- 2024 - -**Motivation:** Functions are the backbone of code. Whether you have a goal of building a complex internet server, designing and creating your own videogame, or analyzing enormous swaths of data instantly, you will need to have a strong working knowledge of functions to do it. Functions enable you to write more readable code, apply custom-made operations in novel ways, and overall are a necessity as a data scientist. In this project, we'll begin to explore the differences between functions and methods in Python, and start to write our own as well! - -**Context:** Again, we'll be building off of all the previous projects here. A strong ability to work with lists and dataframes, analyze documentation and learn from it, and iterate through large amounts of data using a variety of approaches will set you up for success in this project - -**Scope:** Functions, objects, methods, Python - -.Learning Objectives: -**** -- Learn what a function is -- Learn the difference between a function and a method -- Learn about a few common, high-utility functions in Python -- Design and write your first functions -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- /anvil/projects/tdm/data/flights/1987.csv -- /anvil/projects/tdm/data/beer/reviews.csv - -[IMPORTANT] -==== -This project will be working with some larger datasets, so please be sure to reserve 3 cores when starting your Jupyter session. If you are getting a "kernel died" message when you attempt to run your code, feel free to increase it to 4 cores. If your kernel is still dying, please reach out to someone on the TDM team for assistance. You should not reserve more than 4 cores for this project. -==== - -== Questions - -=== Question 1 (2 pts) - -Let's begin by discussing functions. While we've already used numerous functions throughout this semester, we have not yet taken the time to explore what is going on behind the scenes. A function is, at its most basic, some code that takes input data, and returns output data. We often refer to the input data as 'parameters' or 'arguments' and the output data 'return values' or simply 'outputs'. The usefulness of functions comes from their reusability. If we need to do the same action at different points in the code, we can define a function that performs that action and use it repeatedly to make our code cleaner and more readable. - -Functions are first **defined**: their name, the number of arguments and the name of each argument, the code that is performed each time the function is called, and what values are returned are defined in this part of the code. Then, when we want to use a function, we **call** it by writing its name along with any arguments we want to give it. - -Let's look at a brief example below, demonstrating how a function is defined, the inputs it takes, and the value it returns. Please copy this into your Jupyter notebook and experiment with it to really get a feel for how things work before attempting to complete this question. Pay special attention to the comments dissecting each part of the code in detail if you are still having trouble understanding the program flow - -[source, python] ----- -# Nothing in between the equals signs gets run until -# the function is called!! -# ===================================================== -# define a function called foo, that takes two -# arguments: bar and buzz -def foo(bar, buzz): - - # use the first argument, bar, in a print statement - print(f"Hello {bar}, how are you?") - - # return an output of buzz times 10 - return buzz * 10 -# ===================================================== - -# call the function with our own arguments, and -# store the output to funcOut1 -funcOut1 = foo("Jackson", 20) -print(funcOut1) - -# we can also pass in defined variables as arguments, -# like so: -var1 = "Jimbob" -var2 = 13 -funcOut2 = foo(var1, var2) -print(funcOut2) ----- - -For this question, we want you to define your own function called `is_leap()` that takes one variable, called `year`, as input, and returns `True` if `year` is a leap year and `False` if year is not a leap year. (Hint: You should already have the code to do this in project 2, you just have to turn it into a function!!) - -[NOTE] -==== -Here are some test cases for you to use to double-check that your code is working as expected. -- 1896, 2000, 2004, 2008, and 2024 are all leap years -- 1700, 1900, and 2010 are all not leap years -==== - -.Deliverables -==== -- A function, `is_leap()`, that returns a boolean dictating whether or not a function is a leap year or not -==== - -=== Question 2 (2 pts) - -Awesome. We now know in a real sense what a function is, how to define it, and how to use it in our code. Let's keep building on this by reading in some data and learning how we can apply functions to dataframes all at once! - -[IMPORTANT] -==== -If you missed the note at the top of the project, I will reiterate here once more: you will very likely need to use at least 3 cores for this project. If you get a "kernel died" message, try using 3-4 cores instead of 2. If your kernel is still dying at 4 cores, please reach out to someone on the TDM team for assistance. -==== - -First off, read the "/anvil/projects/tdm/data/flights/1987.csv" data into a new dataframe called `flights_1987`. Print the head of the dataset. - -You should notice a column called "DayOfWeek". Write a function called `dayNamer()` that, given a number for day of the week, returns the name of that day. Run it on at least 3 different rows of the data to verify that it works. (Hint: DayOfWeek depicts Monday as 1, Tuesday as 2, and so on.) - -[NOTE] -==== -The first 5 days in the data, in order, are Friday, Saturday, Thursday, Friday, Saturday. You can use this to test your function. -==== - -You can use the below code to test your function: - -[source, python] ----- -for index, i in enumerate(flights_1987["DayOfWeek"]): - print(f"Day {index + 1}: {dayNamer(i)}") - if index == 4: - break ----- - -.Deliverables -==== -- a function called `dayNamer` that takes as input a number for the day of the week and returns as output a string that is the name of the day. -==== - -=== Question 3 (2 pts) - -Great, we now have a function that converts a day number into a day name. Let's use this function to create a new column, "day_name", in our Pandas dataframe. However, there is a caveat: this is a **LOT** of data. If you try and iterate through it all with a for loop, your kernel will -very likely die (or at the least run very slowly). - -Introducing: the https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html[`.apply()`] method. `.apply()` will allow us to apply a function to every row (or column, if you so choose) in a dataframe in a vectorized and efficient manner. Take a look at the below example, where I use this method to create a new column in our data using a nonsensical function I've written (very similar to what you have to do for this question). - -[source, python] ----- -def scaler(year): - return year * 1000 - -flights_1987["nonsense_years"] = flights_1987["Year"].apply(scaler) -flights_1987.head() ----- - -[NOTE] -==== -You can check your work here by printing the head of your dataframe and making sure the first 5 days match as expected in the previous question. -==== - -.Deliverables -==== -- A new column, "day_name" in the dataframe, generated using your `dayNamer()` function and `apply()`, that is the names of each day corresponding to the pre-existing 'DayOfWeek' column -==== - -=== Question 4 (2 pts) - -Now that we've got a good grasp on functions, let's discuss a small distinction we haven't yet covered. You likely noticed mention of `methods` several times throughout this course, oftentimes in association with things like `.head()` or even more recently `.apply()`. - -In a simple sense, methods are very similar to functions. They even look similar in how they are called and defined. While we won't be covering Object Oriented Programming in TDM 101, and thus won't be covering methods in detail, its important to understand some basics. - -A method, similar to a function, is a reusable chunk of code. However, it is tied to an _object_, which is a tough concept to describe. Let's consider an example. If I define an object that is a "Basketball", methods are actions that I can perform with/on that object. For example, I might have methods like `.deflate()` or `.inflate()` or `.bounce()` to use with my basketball. If I have an instance of a "Basketball" object named "my_basketball", for example, I could call use my methods by running code like `my_basketball.inflate()` or `my_basketball.bounce()`. - -One good example of this in code we've worked with is `.head()`. `.head()` is a method that works on a "dataframe" object and returns the first 5 rows of the data. Another example is the `.apply()` method that we used in the last question, which applies the function you provide as an argument to each row of the dataframe object that it is called on. - -In this question, we want you to explore some new methods that we can use with Pandas dataframes. First, use `.value_counts()` to get a count of how many times each day occurs in the data (using the 'day_name' column you made in the last question). Then, use `len()` and division to figure out what percentage of the days in our data are each day of the week. Your final result should contain printed output with what proportion (or percentage) of our data occurred on each day of the week. Do not use any looping to solve this problem, as it will be both significantly slower and defeat the purpose of using `.value_counts()` and `len()`. - -Finally, in a markdown cell, describe whether `.value_counts()` and `len()` are methods or functions. Justify your answer. - -[NOTE] -==== -We've now used https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html[`.value_counts()`] and https://www.w3schools.com/python/ref_func_len.asp[`len()`] in multiple projects, but feel free to refer back to their docs pages if necessary. -==== - -.Deliverables -==== -- The proportion of each day in our dataset, printed out. -- A markdown cell identifying which of `.value_counts()` and `len()` was a function and which was a method. -==== - -=== Question 5 (2 pts) - -For this last question, we'll start getting into the more complex functions that we'll be spending lots of time on in the next few projects. The function you will write for this question is as follows: - -- called `prop_dict_maker()` -- Takes two arguments, a dataframe and a column -- Returns a dictionary of the proportions of each value in that column - -If you're struggling with where to start, try and approach this problem like so: - -. First, write some code to do this on a specific dataframe and column of your choice (Hint: We did this in the last problem!) -. Next, wrap that code in a function definition, and replace the dataframe and column you chose with your function arguments as needed. -. Finally, be sure that you are returning a dictionary as expected, and test your function a few times with known results. - -Finally, run the following code: - -[source, python] ----- -# import our library -import pandas as pd - -# read in some beer review data -beer_reviews = pd.read_csv("/anvil/projects/tdm/data/beer/reviews.csv") - -# get a dictionary of user proportions -top_users = prop_dict_maker(beer_reviews, "username") - -# print the top 5 users in the data -print(sorted(top_users, key=top_users.get, reverse=True)[:5]) ----- - -Which should have an output like this if you did everything correctly: - -`['Sammy', 'kylehay2004', 'acurtis', 'StonedTrippin', 'jaydoc']` - -.Deliverables -==== -- The `prop_dict_maker()` function as described above -- The results of running the provided testing code using your `prop_dict_maker()` function -==== - -== Submitting your Work - -Congratulations, you've finished your first in-depth project on functions in Python! Going forward, you should be getting quite comfortable in writing your own functions to analyze data, perform calculations, and otherwise simplify repetitive tasks in your code. You should also be able to differentiate between methods and functions, and understand what notation you should use when calling something based on whether it is a function or a method. - -In the next project, we'll finish up our exploration of functions in Python, and begin exploring visualizing data and analyzing it to create good summary statistics and graphics. - -.Items to submit -==== -- firstname_lastname_project7.ipynb -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project8.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project8.adoc deleted file mode 100644 index 00a2e530d..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-project8.adoc +++ /dev/null @@ -1,87 +0,0 @@ -= TDM 10100: Python Project 8 -- 2024 - -**Motivation:** Ipsum lorem - -**Context:** Ipsum lorem - -**Scope:** Ipsum lorem - -.Learning Objectives: -**** -- Ipsum lorem -- Ipsum lorem -- Ipsum lorem -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- Ipsum lorem -- Ipsum lorem - -== Questions - -=== Question 1 - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 2 - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 3 - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 4 - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 5 - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -== Submitting your Work - -This is where we're going to say how to submit your work. Probably a bit of copypasta. - -.Items to submit -==== -- Ipsum lorem -- Ipsum lorem -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/current-projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-projects.adoc deleted file mode 100644 index 3adcb8268..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/10200/10200-2025-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 10200 - -== Important Links - -xref:spring2025/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:spring2025/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm1010010200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Jan 13, 2025 | Jan 24, 2025 -| Academic Integrity Quiz | Jan 13, 2025 | Jan 24, 2025 -| Project 1 - | Jan 13, 2025 | Jan 29, 2025 -| Project 2 - | Jan 27, 2025 | Feb 5, 2025 -| Project 3 - | Feb 3, 2025 | Feb 12, 2025 -| Outside Event 1 | Jan 13, 2025 | Feb 14, 2025 -| Project 4 - | Feb 10, 2025 | Feb 19, 2025 -| Project 5 - | Feb 17, 2025 | Feb 26, 2025 -| Project 6 - | Feb 24, 2025 | Mar 5, 2025 -| Project 7 - | Mar 3, 2025 | Mar 12, 2025 -| Outside Event 2 | Jan 13, 2025 | Mar 14, 2025 -| Project 8 - | Mar 10, 2025 | Mar 26, 2025 -| Project 9 - | Mar 24, 2025 | Apr 2, 2025 -| Project 10 - | Mar 31, 2025 | Apr 9, 2025 -| Project 11 - | Apr 7, 2025 | Apr 16, 2025 -| Outside Event 3 | Jan 13, 2025 | Apr 18, 2025 -| Project 12 - | Apr 14, 2025 | Apr 23, 2025 -| Project 13 - | Apr 21, 2025 | Apr 30, 2025 -| Project 14 - Class Survey | Apr 24, 2025 | May 2, 2025 -|=== - -[WARNING] -==== -Projects are **released on Mondays**, and are due 1 week and 2 days later on the following **Wednesday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-1.png b/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-1.png deleted file mode 100644 index 5725b1061..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-1.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-2.png b/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-2.png deleted file mode 100644 index 20887e076..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-2.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-3.png b/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-3.png deleted file mode 100644 index 25af69c7c..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-3.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-4.png b/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-4.png deleted file mode 100644 index e1d449f22..000000000 Binary files a/projects-appendix/modules/ROOT/pages/spring2025/10200/images/1-4.png and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/spring2025/20200/20200-2025-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2025/20200/20200-2025-projects.adoc deleted file mode 100644 index 5096a91a1..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/20200/20200-2025-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 20200 - -== Important Links - -xref:spring2025/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:spring2025/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm2010020200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Jan 13, 2025 | Jan 24, 2025 -| Academic Integrity Quiz | Jan 13, 2025 | Jan 24, 2025 -| Project 1 - | Jan 13, 2025 | Jan 29, 2025 -| Project 2 - | Jan 27, 2025 | Feb 5, 2025 -| Project 3 - | Feb 3, 2025 | Feb 12, 2025 -| Outside Event 1 | Jan 13, 2025 | Feb 14, 2025 -| Project 4 - | Feb 10, 2025 | Feb 19, 2025 -| Project 5 - | Feb 17, 2025 | Feb 26, 2025 -| Project 6 - | Feb 24, 2025 | Mar 5, 2025 -| Project 7 - | Mar 3, 2025 | Mar 12, 2025 -| Outside Event 2 | Jan 13, 2025 | Mar 14, 2025 -| Project 8 - | Mar 10, 2025 | Mar 26, 2025 -| Project 9 - | Mar 24, 2025 | Apr 2, 2025 -| Project 10 - | Mar 31, 2025 | Apr 9, 2025 -| Project 11 - | Apr 7, 2025 | Apr 16, 2025 -| Outside Event 3 | Jan 13, 2025 | Apr 18, 2025 -| Project 12 - | Apr 14, 2025 | Apr 23, 2025 -| Project 13 - | Apr 21, 2025 | Apr 30, 2025 -| Project 14 - Class Survey | Apr 24, 2025 | May 2, 2025 -|=== - -[WARNING] -==== -Projects are **released on Mondays**, and are due 1 week and 2 days later on the following **Wednesday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2025/30200/30200-2025-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2025/30200/30200-2025-projects.adoc deleted file mode 100644 index 30b06e9fb..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/30200/30200-2025-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 30200 - -== Important Links - -xref:spring2025/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:spring2025/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm3010030200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Jan 13, 2025 | Jan 24, 2025 -| Academic Integrity Quiz | Jan 13, 2025 | Jan 24, 2025 -| Project 1 - | Jan 13, 2025 | Jan 29, 2025 -| Project 2 - | Jan 27, 2025 | Feb 5, 2025 -| Project 3 - | Feb 3, 2025 | Feb 12, 2025 -| Outside Event 1 | Jan 13, 2025 | Feb 14, 2025 -| Project 4 - | Feb 10, 2025 | Feb 19, 2025 -| Project 5 - | Feb 17, 2025 | Feb 26, 2025 -| Project 6 - | Feb 24, 2025 | Mar 5, 2025 -| Project 7 - | Mar 3, 2025 | Mar 12, 2025 -| Outside Event 2 | Jan 13, 2025 | Mar 14, 2025 -| Project 8 - | Mar 10, 2025 | Mar 26, 2025 -| Project 9 - | Mar 24, 2025 | Apr 2, 2025 -| Project 10 - | Mar 31, 2025 | Apr 9, 2025 -| Project 11 - | Apr 7, 2025 | Apr 16, 2025 -| Outside Event 3 | Jan 13, 2025 | Apr 18, 2025 -| Project 12 - | Apr 14, 2025 | Apr 23, 2025 -| Project 13 - | Apr 21, 2025 | Apr 30, 2025 -| Project 14 - Class Survey | Apr 24, 2025 | May 2, 2025 -|=== - -[WARNING] -==== -Projects are **released on Mondays**, and are due 1 week and 2 days later on the following **Wednesday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2025/40200/40200-2025-projects.adoc b/projects-appendix/modules/ROOT/pages/spring2025/40200/40200-2025-projects.adoc deleted file mode 100644 index 8dcf32cc8..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/40200/40200-2025-projects.adoc +++ /dev/null @@ -1,51 +0,0 @@ -= TDM 40200 - -== Important Links - -xref:spring2025/logistics/office_hours.adoc[[.custom_button]#Office Hours#] -xref:spring2025/logistics/syllabus.adoc[[.custom_button]#Syllabus#] -https://piazza.com/purdue/fall2024/tdm4010040200202425[[.custom_button]#Piazza#] - -== Assignment Schedule - -[NOTE] -==== -Only the best 10 of 14 projects will count towards your grade. -==== - -[CAUTION] -==== -Topics are subject to change. While this is a rough sketch of the project topics, we may adjust the topics as the semester progresses. -==== - -|=== -| Assignment | Release Date | Due Date -| Syllabus Quiz | Jan 13, 2025 | Jan 24, 2025 -| Academic Integrity Quiz | Jan 13, 2025 | Jan 24, 2025 -| Project 1 - | Jan 13, 2025 | Jan 29, 2025 -| Project 2 - | Jan 27, 2025 | Feb 5, 2025 -| Project 3 - | Feb 3, 2025 | Feb 12, 2025 -| Outside Event 1 | Jan 13, 2025 | Feb 14, 2025 -| Project 4 - | Feb 10, 2025 | Feb 19, 2025 -| Project 5 - | Feb 17, 2025 | Feb 26, 2025 -| Project 6 - | Feb 24, 2025 | Mar 5, 2025 -| Project 7 - | Mar 3, 2025 | Mar 12, 2025 -| Outside Event 2 | Jan 13, 2025 | Mar 14, 2025 -| Project 8 - | Mar 10, 2025 | Mar 26, 2025 -| Project 9 - | Mar 24, 2025 | Apr 2, 2025 -| Project 10 - | Mar 31, 2025 | Apr 9, 2025 -| Project 11 - | Apr 7, 2025 | Apr 16, 2025 -| Outside Event 3 | Jan 13, 2025 | Apr 18, 2025 -| Project 12 - | Apr 14, 2025 | Apr 23, 2025 -| Project 13 - | Apr 21, 2025 | Apr 30, 2025 -| Project 14 - Class Survey | Apr 24, 2025 | May 2, 2025 -|=== - -[WARNING] -==== -Projects are **released on Mondays**, and are due 1 week and 2 days later on the following **Wednesday, by 11:59pm**. Late work is **not** accepted. We give partial credit for work you have completed -- **always** submit the work you have completed before the due date. If you do _not_ submit the work you were able to get done, we will _not_ be able to give you credit for the work you were able to complete. - -**Always** double check that the work that you submitted was uploaded properly. See xref:submissions.adoc[here] for more information. - -Each week, we will announce in Piazza that a project is officially released. Some projects, or parts of projects may be released in advance of the official release date. **Work on projects ahead of time at your own risk.** These projects are subject to change until the official release announcement in Piazza. -==== diff --git a/projects-appendix/modules/ROOT/pages/spring2025/logistics/office_hours.adoc b/projects-appendix/modules/ROOT/pages/spring2025/logistics/office_hours.adoc deleted file mode 100644 index d44ec4f69..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/logistics/office_hours.adoc +++ /dev/null @@ -1,33 +0,0 @@ -= Spring 2025 Office Hours Schedule - -[IMPORTANT] -==== -Office hours after 5 PM will be held exclusively virtually, whereas office hours prior to 5 will be offered both in-person in the lobby of Hillenbrand Hall and remotely. - -Office Hours Zoom Link: https://purdue-edu.zoom.us/s/97774213087 - -Checklist for the Zoom Link: - -* When joining office hours, please include your Data Mine level in front of your name. For example, if you are in TDM 102, your name should be entered as “102 - [Your First Name] [Your Last Name]”. - -* After joining the Zoom call, please stay in the main room until a TA invites you to a specific breakout room. - -* We will continue to follow the office hours schedule as posted on the Examples Book. (https://the-examples-book.com/projects/spring2025/logistics/office_hours) -==== - -[NOTE] -==== -The below calendars represent regularly occurring office hours. Please check your class' Piazza page to view the latest information about any upcoming changes or cancellations for office hours prior to attending. -==== - -== TDM 10100 -image::f24-101-OH.png[10100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="10100 Office Hours Schedule"] - -== TDM 20100 -image::f24-201-OH.png[20100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="20100 Office Hours Schedule"] - -== TDM 30100 -image::f24-301-OH.png[30100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="30100 Office Hours Schedule"] - -== TDM 40100 -image::f24-401-OH.png[40100 Office Hours Schedule, width=1267, height=800, loading=lazy, title="40100 Office Hours Schedule"] \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/spring2025/logistics/syllabus.adoc b/projects-appendix/modules/ROOT/pages/spring2025/logistics/syllabus.adoc deleted file mode 100644 index 4c50b89e2..000000000 --- a/projects-appendix/modules/ROOT/pages/spring2025/logistics/syllabus.adoc +++ /dev/null @@ -1,280 +0,0 @@ -= Spring 2025 Syllabus - The Data Mine Seminar - -== Course Information - -[%header,format=csv,stripes=even] -|=== -Course Number and Title, CRN -TDM 10200 - The Data Mine II, possible CRNs 19799 or 19803 or 19810 or 19841 or 28531 or 27497 or 27501 or 27505 -TDM 20200 - The Data Mine IV, possible CRNs 19800 or 19805 or 19811 or 19842 or 28529 or 27498 or 27502 or 27506 -TDM 30200 - The Data Mine VI, possible CRNs 19801 or 19807 or 19817 or 19843 or 28532 or 27495 or 27499 or 27503 -TDM 40200 - The Data Mine VIII, possible CRNs 19802 or 19808 or 19821 or 19859 or 28530 or 27496 or 27500 or 27504 -TDM 50100 - The Data Mine Seminar, possible CRNs 20007 or 19997 or 20010 or 20006 or 18013 or 17984 or 18009 -|=== - -*Course credit hours:* -1 credit hour, so you should expect to spend about 3 hours per week doing work for the class - -*Prerequisites:* -TDM 10100 and TDM 10200 can be taken in either order. Both of these courses are introductory. TDM 10100 is an introduction to data analysis in R. TDM 10200 is an introduction to data analysis in Python. - -For all of the remaining TDM seminar courses, students are expected to take the courses in order (with a passing grade), namely, TDM 20100, 20200, 30100, 30200, 40100, 40200. The topics in these courses build on the knowledge from the previous courses. All students, regardless of background are welcome. TDM 50100 is geared toward graduate students and can be taken repeatedly; TDM 50100 meets concurrently with the other courses, at whichever level is appropriate for the graduate students in the course. We can make adjustments on an individual basis if needed. - - -=== Course Web Pages - -- link:https://the-examples-book.com/[*The Examples Book*] - All information will be posted within these pages! -- link:https://www.gradescope.com/[*Gradescope*] - All projects and outside events will be submitted on Gradescope -- link:https://purdue.brightspace.com/[*Brightspace*] - Grades will be posted in Brightspace. Students will also take the quizzes at the beginning of the semester on Brightspace -- link:https://piazza.com[*Piazza*] - Online Q/A Forum -- link:https://datamine.purdue.edu[*The Data Mine's website*] - Helpful resource -- link:https://ondemand.anvil.rcac.purdue.edu/[*Jupyter Lab via the On Demand Gateway on Anvil*] - -=== Meeting Times -There are officially 4 Monday class times: 8:30 am, 9:30 am, 10:30 am (all in the Hillenbrand Dining Court atrium—no meal swipe required), and 4:30 pm (https://purdue-edu.zoom.us/my/mdward[synchronous online], recorded and posted later; This online meeting is also available to students participating in Seminar from other universities outside of Purdue). There is also an asynchronous class section. All the information you need to work on the projects each week will be provided online on the Thursday of the previous week, and we encourage you to get a head start on the projects before class time. Dr. Ward does not lecture during the class meetings. Instead, the seminar time is a good time to ask questions and get help from Dr. Ward, the T.A.s, and your classmates. Attendance is not required. The T.A.s will have many daytime and evening office hours throughout the week. - -=== Course Description - -The Data Mine is a supportive environment for students in any major and from any background who want to learn some data science skills. Students will have hands-on experience with computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data, especially big data sets, and in effectively communicating insights about data. Topics include: the R environment, Python, visualizing data, UNIX, bash, regular expressions, SQL, XML and scraping data from the internet, as well as selected advanced topics, as time permits. - -=== Learning Outcomes - -By the end of the course, you will be able to: - -. Discover data science and professional development opportunities in order to prepare for a career. -. Explain the difference between research computing and basic personal computing data science capabilities in order to know which system is appropriate for a data science project. -. Design efficient search strategies in order to acquire new data science skills. -. Devise the most appropriate data science strategy in order to answer a research question. -. Apply data science techniques in order to answer a research question about a big data set. - -=== Mapping to Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* In each of the 14 weekly projects, the scope is described at a high level at the very top of the project. Students are expected to tie their analysis on the individual weekly questions back to the stated scope. As an example of the stated scope in a project: Understanding how to use Pandas and be able to develop functions allows for a systematic approach to analyzing data. In this project, students will already be familiar with Pandas but will not (yet) know at the outset how to "develop functions" and take a "systematic approach" to solving the questions. Students are expected to comment on each question about how their "line of inquiry" and "formulation of the question" ties back to the stated scope of the project. As the seminar progresses past the first few weeks, and the students are being asked to tackle more complex problems, they need to identify which Python, SQL, R, and UNIX tools to use, and which statements and queries to run (this is "formulating the questions"), in order to get to analyze the data, derive the results, and summary the results in writing and visualizations ("determining the scope of the investigation"). -. *Locate information using effective search strategies and relevant information sources.* The Data Mine seminar progresses by increasing the complexity of the problems. The students are being asked to solve complex problems using data science tools. Students need to "locate information" within technical documentation, API documentation, online manuals, online discussions such as Stack Overflow, etc. Within these online resources, they need to determine the "relevant information sources" and apply these sources to solve the data analysis problem at hand. They need to understand the context, motivation, technical notation, nomenclature of the tools, etc. We enable students to practice this skill on every weekly project during the semester, and we provide additional resources, such as Piazza (an online discussion platform to interact with peers, teaching assistants, and the instructor), office hours throughout the week, and attending in-person or virtual seminar, for interaction directly with the instructor. -. *Evaluate the credibility of information. The students work together this objective in several ways.* They need evaluate and analyze the "credibility of information" and data from a wide array of resources, e.g., from the federal government, from Kaggle, from online repositories and archives, etc. Each project during the semester focuses attention on a large data repository, and the students need to understand the credible data, the missing data, the inaccurate data, the data that are outliers, etc. Some of the projects for students involve data cleansing efforts, data imputation, data standardization, etc. Students also need to validate, verify, determine any missing data, understand variables, correlation, contextual information, and produce models and data visualizations from the data under consideration. -. *Synthesize and organize information from different sources in order to communicate.* This is a key aspect of The Data Mine. In many of the student projects, they need to assimilate geospatial data, categorical and numerical data, textual data, and visualizations, in order to have a comprehensive data analysis of a system or a model. The students can use help from Piazza, office hours, the videos from the instructor and seminar live sessions to synthesize and organize the information they are learning about, in each project. The students often need to also understand many different types of tools and aspects of data analysis, sometimes in the same project, e.g., APIs, data dictionaries, functions, concepts from software engineering such as scoping, encapsulation, containerization, and concepts from spatial and temporal analysis. Synthesizing many "different sources" to derive and "communicate" the analysis is a key aspect of the projects. -. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* In every project, students need to use "citations to sources" (online and written), "referencing" forums and blogs where their cutting-edge concepts are "documented", proper methods of "quotation" and "citation", documentation of any teamwork, etc. The students have a template for their project submissions in which they are required to provide the proper citation of any sources, collaborations, reference materials, etc., in each and every project that they submit every week. -. *Recognize relevant cultural and other contextual factors when using information.* Students weekly project include data and information on data about (all types of genders), political data, geospatial questions, online forums and rating schema, textual data, information about books, music, online repositories, etc. Students need to understand not only the data analysis but also the "context" in which the data is provided, the data sources, the potential usage of the analysis and its "cultural" implications, etc. Students also complete professional development, attending several professional development and outside-the-classroom events each semester. The meet with alumni, business professionals, data practitioners, data engineers, managers, scientists from national labs, etc. They attend events about the "culture related to data science", and "multicultural events". Students are required to respond in writing to every such event, and their writing is graded and incorporated into the grades for the course. -. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester that sets the stage of these "ethical and legal guidelines and requirements". They have documentation about proper data handling and data management techniques. They learn about the context of data usage, including (for instance) copyrights, the difference between open source and proprietary data, different types of software licenses, the need for confidentiality with Corporate Partners projects, etc. - -=== Assessment of Foundational Learning Outcome (FLO) = Information Literacy - -Note: The Data Mine has applied for the course seminar to satisfy the information literacy outcome, but this request is still under review by the university. This request has not yet been approved. - -. *Assessment method for this course.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. -. *Identify a line of inquiry that requires information, including formulating questions and determining the scope of the investigation.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Students identify which R and Python statements and queries to run (this is formulating the questions), in order to get to the results they think they are looking for (determining the scope of the investigation). -. *Locate information using effective search strategies and relevant information sources.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. The students are being asked to solve complex problems using data science tools. They need to figure out what they are looking to figure out, and to do that they need to figure out what to ask. -. *Evaluate the credibility of information. Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning.* Each week, these projects are graded by teaching assistants based on solutions provided. Some of the projects that students complete in the course involve data cleansing efforts including validation, verification, missing data, and modeling and students must evaluate the credibility as they move through the project. -. *Synthesize and organize information from different sources in order to communicate.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. Information on how to complete the projects is learned through many sources and student utilize an experiential learning model. -. *Attribute original ideas of others through proper citing, referencing, paraphrasing, summarizing, and quoting.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided set and then questions about the data set that engage the student in experiential learning. At the beginning of each project there is a question regarding citations for the project. -. *Recognize relevant cultural and other contextual factors when using information.* Students are assigned a weekly project that usually includes a data set and then questions about the data set that engage the student in experiential learning. Each week, these projects are graded by teaching assistants based on solutions provided. For professional development event assessment – students are required to attend three approved events and then write a guided summary of the event. -. *Observe ethical and legal guidelines and requirements for the use of published, confidential, and/or proprietary information.* Students complete an academic integrity quiz at the beginning of each semester, and they are also graded on their proper documentation and usage of data throughout the semester, on every weekly project. - -=== Required Materials - -* A laptop so that you can easily work with others. Having audio/video capabilities is useful. -* Access to Brightspace, Gradescope, and Piazza course pages. -* Access to Jupyter Lab at the On Demand Gateway on Anvil: -https://ondemand.anvil.rcac.purdue.edu/ -* "The Examples Book": https://the-examples-book.com -* Good internet connection. - -=== Attendance Policy - -When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. - -For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, parenting leave, and medical excuse. For details, see the link:https://catalog.purdue.edu/content.php?catoid=13&navoid=15965#a-attendance[Academic Regulations & Student Conduct section] of the University Catalog website. - -== How to succeed in this course - -If you would like to be a successful Data Mine student: - -* Start on the weekly projects on or before Mondays so that you have plenty of time to get help from your classmates, TAs, and Data Mine staff. Don’t wait until the due date to start! -* Be excited to challenge yourself and learn impressive new skills. Don’t get discouraged if something is difficult—you’re here because you want to learn, not because you already know everything! -* Remember that Data Mine staff and TAs are excited to work with you! Take advantage of us as resources. -* Network! Get to know your classmates, even if you don’t see them in an actual classroom. You are all part of The Data Mine because you share interests and goals. You have over 800 potential new friends! -* Use "The Examples Book" with lots of explanations and examples to get you started. Google, Stack Overflow, etc. are all great, but "The Examples Book" has been carefully put together to be the most useful to you. https://the-examples-book.com[the-examples-book.com] -* Expect to spend approximately 3 hours per week on the projects. Some might take less time, and occasionally some might take more. -* Don’t forget about the syllabus quiz, academic integrity quiz, and outside event reflections. They all contribute to your grade and are part of the course for a reason. -* If you get behind or feel overwhelmed about this course or anything else, please talk to us! -* Stay on top of deadlines. Announcements will also be sent out every Monday morning, but you should keep a copy of the course schedule where you see it easily. -* Read your emails! - - -== Information about the Instructors - -=== The Data Mine Staff - -[%header,format=csv] -|=== -Name, Title -Shared email we all read, datamine-help@purdue.edu -Kevin Amstutz, Senior Data Scientist -Ashley Arroyo, Data Science Techincal Specialist -Donald Barnes, Guest Relations Administrator -Maggie Betz, Managing Director of The Data Mine at Indianapolis -Kimmie Casale, ASL Tutor -Bryce Castle, Corporate Partners Technical Specialist -Cai Chen, Corporate Partners Technical Specialist -Doug Crabill, Senior Data Scientist -Peter Dragnev, Corporate Partners Technical Specialist -Stacey Dunderman, Program Administration Specialist -Jessica Gerlach, Corporate Partners Technical Specialist -Dan Hirleman, Regional Director of The Data Mine of the Rockies -Jessica Jud, Interim Director of Partnerships -Kali Lacy, Associate Research Engineer -Gloria Lenfestey, Senior Financial Analyst -Nicholas Lenfestey, Corporate Partners Technical Specialist -Naomi Mersinger, ASL Interpreter / Strategic Initiatives Coordinator -Kim Rechkemmer, Senior Program Administration Specialist -Katie Sanders, Chief Operating Officer -Betsy Satchell, Senior Administrative Assistant -Diva Sharma, Corporate Partners Technical Specialist -Shakir Syed, Managing Director of Corporate Partnerships -Fulya Gökalp Yavuz, Director of Data Science -Dr. Mark Daniel Ward, Executive Director -|=== - -The Data Mine Team uses a shared email which functions as a ticketing system. Using a shared email helps the team manage the influx of questions, better distribute questions across the team, and send out faster responses. -You can use the https://piazza.com[Piazza forum] to get in touch. In particular, Dr. Ward responds to questions on Piazza faster than by email. - -=== Communication Guidance - -* *For questions about how to do the homework, use Piazza or visit office hours*. You will receive the fastest response by using Piazza versus emailing us. -* For general Data Mine questions, email datamine-help@purdue.edu -* For regrade requests, use Gradescope's regrade feature within Brightspace. Regrades should be -requested within 1 week of the grade being posted. - - -=== Office Hours - -The xref:spring2025/logistics/office_hours.adoc[office hours schedule is posted here.] - -Office hours are held in person in Hillenbrand lobby and on Zoom. Check the schedule to see the available times. - -=== Piazza - -Piazza is an online discussion board where students can post questions at any time, and Data Mine staff or T.A.s will respond. Piazza is available through Brightspace. There are private and public postings. Last year we had over 11,000 interactions on Piazza, and the typical response time was around 5-10 minutes. - -== Assignments and Grades - -=== Course Schedule & Due Dates - -Click below to view the Spring 2025 Course Schedule: - -xref:spring2025/10200/10200-2025-projects.adoc[TDM 10200] - -xref:spring2025/20200/20200-2025-projects.adoc[TDM 20200] - -xref:spring2025/30200/30200-2025-projects.adoc[TDM 30200] - -xref:spring2025/40200/40200-2025-projects.adoc[TDM 40200] - -See the schedule and later parts of the syllabus for more details, but here is an overview of how the course works: - -In the first week of the beginning of the semester, you will have some "housekeeping" tasks to do, which include taking the Syllabus quiz and Academic Integrity quiz. - -Generally, every week from the very beginning of the semester, you will have your new projects released on a Thursday, and they are due 8 days later on the following Friday at 11:55 pm Purdue West Lafayette (Eastern) time. This semester, there are 14 weekly projects, but we only count your best 10. This means you could miss up to 4 projects due to illness or other reasons, and it won’t hurt your grade. - -We suggest trying to do as many projects as possible so that you can keep up with the material. The projects are much less stressful if they aren’t done at the last minute, and it is possible that our systems will be stressed if you wait until Friday night causing unexpected behavior and long wait times. Try to start your projects on or before Monday each week to leave yourself time to ask questions. - -Outside of projects, you will also complete 3 Outside Event reflections. More information about these is in the "Outside Event Reflections" section below. -The Data Mine does not conduct or collect an assessment during the final exam period. Therefore, TDM Courses are not required to follow the Quiet Period in the https://catalog.purdue.edu/content.php?catoid=16&navoid=20089[Academic Calendar]. - -=== Projects - -* The projects will help you achieve Learning Outcomes #2-5. -* Each weekly programming project is worth 10 points. -* There will be 14 projects available over the semester, and your best 10 will count. -* The 4 project grades that are dropped could be from illnesses, absences, travel, family emergencies, or simply low scores. No excuses necessary. -* No late work will be accepted, even if you are having technical difficulties, so do not work at the last minute. -* There are many opportunities to get help throughout the week, either through Piazza or office hours. We’re waiting for you! Ask questions! -* Follow the instructions for how to submit your projects properly through Gradescope in Brightspace. -* It is ok to get help from others or online, although it is important to document this help in the comment sections of your project submission. You need to say who helped you and how they helped you. -* Each week, the project will be posted on the Thursday before the seminar, the project will be the topic of the seminar and any office hours that week, and then the project will be due by 11:55 pm Eastern time on the following Friday. See the schedule for specific dates. -* If you need to request a regrade on any part of your project, use the regrade request feature inside Gradescope. The regrade request needs to be submitted within one week of the grade being posted (we send an announcement about this). - -=== Outside Event Reflections - -* The Outside Event reflections will help you achieve Learning Outcome #1. They are an opportunity for you to learn more about data science applications, career development, and diversity. -* Throughout the semester, The Data Mine will have many special events and speakers, typically happening in person so you can interact with the presenter, but some may be online and possibly recorded. -* These eligible opportunities will be posted on The Data Mine’s website (https://datamine.purdue.edu/events/[datamine.purdue.edu/events/]) and updated frequently. Feel free to suggest good events that you hear about, too. -* You are required to attend 3 of these over the semester, with 1 due each month. See the schedule for specific due dates. -* You are welcome to do all 3 reflections early. For example, you could submit all 3 reflections in September. -* You must submit your outside event reflection within 1 week of attending the event or watching the recording. -* Follow the instructions on Brightspace for writing and submitting these reflections. -* At least one of these events should be on the topic of Professional Development. These events will be designated by "PD" next to the event on the schedule. -* This semester you will answer questions directly in Gradescope including the name of the event and speaker, the time and date of the event, what was discussed at the event, what you learned from it, what new ideas you would like to explore as a result of what you learned at the event, and what question(s) you would like to ask the presenter if you met them at an after-presentation reception. This should not be just a list of notes you took from the event—it is a reflection. -* We read every single reflection! We care about what you write! We have used these connections to provide new opportunities for you, to thank our speakers, and to learn more about what interests you. - - -=== Late Work Policy - -We generally do NOT accept late work. For the projects, we count only your best 10 out of 14, so that gives you a lot of flexibility. We need to be able to post answer keys for the rest of the class in a timely manner, and we can’t do this if we are waiting for other students to turn their work in. - -=== Grade Distribution - -[cols="4,1"] -|=== - -|Projects (best 10 out of Projects #1-14) |86% -|Outside event reflections (3 total) |12% -|Academic Integrity Quiz |1% -|Syllabus Quiz |1% -|*Total* |*100%* - -|=== - - -=== Grading Scale - -In this class grades reflect your achievement throughout the semester in the various course components listed above. Your grades will be maintained in Brightspace. This course will follow the 90-80-70-60 grading scale for A, B, C, D cut-offs. If you earn a 90.000 in the class, for example, that is a solid A. /- grades will be given at the instructor's discretion below these cut-offs. If you earn an 89.11 in the class, for example, this may be an A- or a B. -* A: 100.000% - 90.000% -* B: 89.999% - 80.000% -* C: 79.999% - 70.000% -* D: 69.999% - 60.000% -* F: 59.999% - 0.000% - - - -=== Academic Integrity - -Academic integrity is one of the highest values that Purdue University holds. Individuals are encouraged to alert university officials to potential breaches of this value by either link:mailto:integrity@purdue.edu[emailing] or by calling 765-494-8778. While information may be submitted anonymously, the more information that is submitted provides the greatest opportunity for the university to investigate the concern. - -In TDM 10200/20200/30200/40200/50100, we encourage students to work together. However, there is a difference between good collaboration and academic misconduct. We expect you to read over this list, and you will be held responsible for violating these rules. We are serious about protecting the hard-working students in this course. We want a grade for The Data Mine seminar to have value for everyone and to represent what you truly know. We may punish both the student who cheats and the student who allows or enables another student to cheat. Punishment could include receiving a 0 on a project, receiving an F for the course, and incidents of academic misconduct reported to the Office of The Dean of Students. - - -*Good Collaboration:* - -* First try the project yourself, on your own. -* After trying the project yourself, then get together with a small group of other students who have also tried the project themselves to discuss ideas for how to do the more difficult problems. Document in the comments section any suggestions you took from your classmates or your TA. -* Finish the project on your own so that what you turn in truly represents your own understanding of the material. -* Look up potential solutions for how to do part of the project online, but document in the comments section where you found the information. -* If the assignment involves writing a long, worded explanation, you may proofread somebody’s completed written work and allow them to proofread your work. Do this only after you have both completed your own assignments, though. - -*Academic Misconduct:* - -* Divide up the problems among a group. (You do #1, I’ll do #2, and he’ll do #3: then we’ll share our work to get the assignment done more quickly.) -* Attend a group work session without having first worked all of the problems yourself. -* Allowing your partners to do all of the work while you copy answers down, or allowing an unprepared partner to copy your answers. -* Letting another student copy your work or doing the work for them. -* Sharing files or typing on somebody else’s computer or in their computing account. -* Getting help from a classmate or a TA without documenting that help in the comments section. -* Looking up a potential solution online without documenting that help in the comments section. -* Reading someone else’s answers before you have completed your work. -* Have a tutor or TA work though all (or some) of your problems for you. -* Uploading, downloading, or using old course materials from Course Hero, Chegg, or similar sites. -* Using the same outside event reflection (or parts of it) more than once. Using an outside event reflection from a previous semester. -* Using somebody else’s outside event reflection rather than attending the event yourself. - - -The link:https://www.purdue.edu/odos/osrr/honor-pledge/about.html[Purdue Honor Pledge] "As a boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together - we are Purdue" - -Please refer to the link:https://www.purdue.edu/odos/osrr/academic-integrity/index.html[student guide for academic integrity] for more details. - -=== xref:fall2023/logistics/syllabus_purdue_policies.adoc[Purdue Policies & Resources] - -=== Disclaimer -This syllabus is subject to small changes. All questions and feedback are always welcome! diff --git a/projects-appendix/modules/ROOT/pages/submissions.adoc b/projects-appendix/modules/ROOT/pages/submissions.adoc deleted file mode 100644 index 69d7f3c7b..000000000 --- a/projects-appendix/modules/ROOT/pages/submissions.adoc +++ /dev/null @@ -1,61 +0,0 @@ -= Submissions - -++++ - -++++ - -Unless otherwise specified, all projects *will only need 1 submitted file*: - -The `.ipynb` file (based off of the provided template). See xref:submissions.adoc#how-to-download-notebook[here] to learn how to download the `.ipynb` file. - -[IMPORTANT] -==== -Output _must_ be displayed in the `.ipynb` file (unless otherwise specified). **Double check** that the output is correct _in Gradescope_. You are responsible for all work submitted in Gradescope. If your submission does not render properly, please contact a TA for help. - -You will be graded on the work and how it is rendered in Gradescope, _not_ how it renders on https://ondemand.anvil.rcac.purdue.edu. Please see xref:submissions.adoc#double-checking-submissions[here] to learn how to properly double check your submission and make sure it renders properly in Gradescope. -==== - -== How to download notebook - -First make sure you are in Jupyter Lab on https://ondemand.anvil.rcac.purdue.edu. It should look something like the following. - -image::figure32.webp[Jupyter Lab interface, width=792, height=500, loading=lazy, title="Jupyter Lab interface"] - -To download your notebook (`.ipynb` file), click on menu:File[Download] and select where you'd like to save the file. Then, when uploading your submission to Gradescope, simply upload the `.ipynb` file you _just_ downloaded. - -image::figure31.webp[How to download the notebook, width=792, height=500, loading=lazy, title="How to download the notebook"] - -== Double checking submissions - -In order to double check that your submission (namely, your `.ipynb` file) renders properly in Gradescope, first submit your project files in Gradescope. - -[TIP] -==== -Don't worry, you can submit as many times as you want. The graders will always see your most recent submission. -==== - -Once submitted, you should be presented with a screen that looks similar to the following. - -image::figure28.webp[Post submit screen, width=792, height=500, loading=lazy, title="Post submit screen"] - -Click on the button in the upper right-hand corner named "Code". - -image::figure29.webp[Click "Code", width=792, height=500, loading=lazy, title="Click Code"] - -You should be presented with the same screen that your grader sees. Look at your notebook carefully to make sure your solutions appear as you intended. - -image::figure30.webp[Double check rendered notebook, width=792, height=500, loading=lazy, title="Double check rendered notebook"] - -== How to make a Python file - -This video demonstrates how to make a Python file to submit along with the Jupyter Lab file for your projects. (This is not needed for most projects. For most projects, starting in fall 2024, only the Jupyter Lab ".ipynb" file is needed.) - -++++ - -++++ - - -[TIP] -==== -When uploading to Gradescope, make sure that you upload your `.ipynb` file for the project (and any other files requested), all at once. Gradescope will only remember the most recent upload, so you need to upload all 3 files at one time, i.e., in one batch upload. -==== diff --git a/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.Rmd b/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.Rmd deleted file mode 100644 index db6fc5307..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.Rmd +++ /dev/null @@ -1,182 +0,0 @@ ---- -title: "Example 1 Extended" -output: - pdf_document: default - html_document: default ---- - -We first load the RMariaDB. The installation line is commented out, because we assume that you have already installed this package while watching the analogous video. - -```{r} -# install.packages("RMariaDB") -library(RMariaDB) -``` - -Now we make a connection to the elections database on Scholar. - -```{r} -connection<-dbConnect(RMariaDB::MariaDB(), - host="scholar-db.rcac.purdue.edu", - db="elections", - user="elections_user", - password="Dataelect!98") -``` - -We query the campaign contributions made by people who work at Purdue University. - -```{r} -myDF <- dbGetQuery(connection, "SELECT * FROM elections WHERE employer='PURDUE UNIVERSITY'") -``` - -Here are the first six of these donations. - -```{r} -head(myDF) -``` - -Now we display the number of rows and columns in the result. -The number of rows is the number of donations made by people who work at Purdue University. -The number of columns is the number of variables that we have in this dataset. - -```{r} -dim(myDF) -``` - -Finally, we extract the cities where people made the donations. -We tabulate the results, and then sort this table. -Among the donations made by employees of Purdue University, -this shows the cities in which the donations were most frequently made. - -```{r} -sort(table(myDF$city)) -``` - -Here are some more examples from the elections database: - -Most of the Purdue employees (who are making donations) are from the State of Indiana. - -```{r} -sort(table(myDF$state)) -``` - -Here is a visualization of the number of donations made by Purdue University employees, grouped according to State. -We are only showing the 6 most popular States, by this measure. - -```{r} -dotchart(tail(sort(table(myDF$state)))) -``` - -Note: A warning appears, letting us know that R will treat the data is numeric. - -Making one simple change, namely, switching state to name, we can see who has made the largest number of donations. -Note: This is not the largest monetary amount of donations! -Instead, this is the greatest number of times that donations were made. - -```{r} -tail(sort(table(myDF$name))) -``` - -As before, we can plot this data. - -```{r} -dotchart(tail(sort(table(myDF$name)))) -``` - -With this plot in mind, we see why we put the names on the y-axis and the number of donations on the x-axis. -(If the names were on the x-axis, we would not have been able to squeeze all of the names in!) - -In the questions above, note that we made 1 SQL query (namely, to get the data from people who work at Purdue), -and we stored this data in a data frame called myDF. Then we did some analysis based on this data frame. - -Now we make another query. This time, we lookup all of the donations across all of the years. -We sum the amounts of the donations, grouping the amounts of the donations, according to the state where the donor lives. - -```{r} -myDF <- dbGetQuery(connection, "SELECT SUM(transaction_amt), state FROM elections GROUP BY state") -myDF -``` - -In our first SQL query, we retrieved all of the information about all of the variables that met our criteria. -The fact that we wanted all variables in the first query is signified by the "*". In general, when we see a "*" in data science, it means that we want all such items or results. - -In this second SQL query, however, we only extract two variables, namely, the sum of the transaction amouns, and the states. -We also group the results from the SQL query according to the state where the donor lives. - -Notice that the results have 285 rows and 2 columns. - -```{r} -head(myDF) -dim(myDF) -``` - -The first column is the sum of the transactions from the state, and the second column is the state. -There are many erroneous states. - -If we save the sum of transaction amounts as v and give each element of v the name of that state - -```{r} -v <- myDF$`SUM(transaction_amt)` -names(v) <- myDF$state -``` - -then we are ready to sort the data, as we did before. - -```{r} -sort(v) -``` - -and this looks reasonable. The greatest monetary donations, altogether, came from California, New York, Texas, etc. - -We can plot this data. The y-axis shows the states, and the x-axis shows the total amount of donations, given in dollars. - -```{r} -dotchart(tail(sort(v))) -``` - -If you want to just see (for instance) the first 10 rows of a database table and all of the variables, you can limit the results. - -```{r} -myDF <- dbGetQuery(connection, "SELECT * FROM elections LIMIT 10") -myDF -``` - -Warning: Please note that you probably do NOT want to try this: If we removed the LIMIT 10, then we would pull all of the data from the entire database table, and that would take a very long time! That is not recommended. - -We can also check the employers, to compute the monetary amount were made by the employees from each company. - -Warning: "Here be dragons!" The new query might take (say) 15 or 20 minutes to run. - -```{r} -myDF <- dbGetQuery(connection, "SELECT SUM(transaction_amt), employer FROM elections GROUP BY employer") -``` - -There are 4.4 million employwers listed in the database: - -```{r} -dim(myDF) -``` - -Again, we save the sum of transaction amounts as v and give each element of v the name of that employer - -```{r} -v <- myDF$`SUM(transaction_amt)` -names(v) <- myDF$employer -``` - -then we are ready to sort the data, as we did before. There are too many results to see the whole list, so we look at the tail of the sorted results. - -```{r} -tail(sort(v)) -``` - -and this looks reasonable. The greatest monetary donations, altogether, were either listed without an employer, or retired, or self-employed, or not employed, or N/A, or from Bloomberg Inc. - -If we want to see a few more results, we can specify how many results we want to see in the tail command. - -```{r} -tail(sort(v), n=30) -``` - -Notice that there are lots of missing data values, and data for which the employer was not listed. -This is common with real world data, and it is something to get used to, when working on data analysis. - diff --git a/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.pdf b/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.pdf deleted file mode 100644 index 246b1e055..000000000 Binary files a/projects-appendix/modules/ROOT/pages/summer2020/think-summer-example-template-2020.pdf and /dev/null differ diff --git a/projects-appendix/modules/ROOT/pages/summer2020/think_summer_project_template.ipynb b/projects-appendix/modules/ROOT/pages/summer2020/think_summer_project_template.ipynb deleted file mode 100644 index 411122592..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2020/think_summer_project_template.ipynb +++ /dev/null @@ -1,190 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "be02a957-7133-4d02-818e-fedeb3cecb05", - "metadata": {}, - "source": [ - "# Project X -- [First Name] [Last Name]" - ] - }, - { - "cell_type": "markdown", - "id": "a1228853-dd19-4ab2-89e0-0394d7d72de3", - "metadata": {}, - "source": [ - "**TA Help:** John Smith, Alice Jones\n", - "\n", - "- Help with figuring out how to write a function.\n", - " \n", - "**Collaboration:** Friend1, Friend2\n", - " \n", - "- Helped figuring out how to load the dataset.\n", - "- Helped debug error with my plot." - ] - }, - { - "cell_type": "markdown", - "id": "6180e742-8e39-4698-98ff-5b00c8cf8ea0", - "metadata": {}, - "source": [ - "## Question 1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "49445606-d363-41b4-b479-e319a9a84c01", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "b456e57c-4a12-464b-999a-ef2df5af80c1", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "fc601975-35ed-4680-a4e1-0273ee3cc047", - "metadata": {}, - "source": [ - "## Question 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a16336a1-1ef0-41e8-bc7c-49387db27497", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "14dc22d4-ddc3-41cc-a91a-cb0025bc0c80", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "8e586edd-ff26-4ce2-8f6b-2424b26f2929", - "metadata": {}, - "source": [ - "## Question 3" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bbe0f40d-9655-4653-9ca8-886bdb61cb91", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "47c6229f-35f7-400c-8366-c442baa5cf47", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "da22f29c-d245-4d2b-9fc1-ca14cb6087d9", - "metadata": {}, - "source": [ - "## Question 4" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8cffc767-d1c8-4d64-b7dc-f0d2ee8a80d1", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "0d552245-b4d6-474a-9cc9-fa7b8e674d55", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "88c9cdac-3e92-498f-83fa-e089bfc44ac8", - "metadata": {}, - "source": [ - "## Question 5" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d370d7c9-06db-42b9-b75f-240481a5c491", - "metadata": {}, - "outputs": [], - "source": [ - "# code here" - ] - }, - { - "cell_type": "markdown", - "id": "9fbf00fb-2418-460f-ae94-2a32b0c28952", - "metadata": {}, - "source": [ - "Markdown notes and sentences and analysis written here." - ] - }, - { - "cell_type": "markdown", - "id": "f76442d6-d02e-4f26-b9d6-c3183e1d6929", - "metadata": {}, - "source": [ - "## Pledge\n", - "\n", - "By submitting this work I hereby pledge that this is my own, personal work. I've acknowledged in the designated place at the top of this file all sources that I used to complete said work, including but not limited to: online resources, books, and electronic communications. I've noted all collaboration with fellow students and/or TA's. I did not copy or plagiarize another's work.\n", - "\n", - "> As a Boilermaker pursuing academic excellence, I pledge to be honest and true in all that I do. Accountable together – We are Purdue." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "think-summer", - "language": "python", - "name": "think-summer" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.5" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project3Solutions.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project3Solutions.adoc deleted file mode 100644 index 36655a05b..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project3Solutions.adoc +++ /dev/null @@ -1,125 +0,0 @@ -= Think Summer: Project 3 Solutions -- 2022 - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -== 1. Examine all six of the tables in the imdb database: akas, crew, episodes, people, ratings, titles. Identify each of the primary keys in each table, and identify each of the foreign keys in each table. - -We just look at each table, and see which keys are primary and which keys are foreign. - -In the `titles` table, the `title_id` is a primary key - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -In the `ratings` table, the `title_id` is a foreign key - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -In the `akas` table, the `title_id` is a foreign key - -[source,sql] ----- -%%sql -SELECT * FROM akas LIMIT 5; ----- - -In the `people` table, the `person_id` is a primary key - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -In the `crew` table, the `title_id` and `person_id` are foreign keys - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -In the `episodes` table, the `episode_title_id` and `show_title_id` are foreign keys - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -== 2. Write a SQL query to confirm that the `title_id` "tt0413573" does indeed belong to Grey's Anatomy. - -We just query the `titles` table and look for that `title_id` - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0413573' LIMIT 5; ----- - -The `title_id` of Dr Ward's favorite show is "tt0108778" - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0108778' LIMIT 5; ----- - -== 3. Write a query that gets a list of all of the `episodes_title_id`s (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of Grey's Anatomy. - -We use the same technique that we did yesterday with the Friends episodes. - -[source,sql] ----- -%%sql -SELECT * FROM episodes AS e JOIN titles AS t ON e.show_title_id = t.title_id -JOIN titles AS u ON e.episode_title_id = u.title_id -WHERE show_title_id = 'tt0413573' LIMIT 5; ----- - -== 4. Write a query that prints the `primary_title`, `rating`, and `votes` for all films with a rating of at least 8 and at least 50000 votes. Like in the previous version of this question, limit your output to 15 results. - -We just join the `titles` table, and `SELECT` the 3 variables needed. - -[source,sql] ----- -%%sql -SELECT t.primary_title, r.rating, r.votes FROM ratings AS r JOIN titles AS t WHERE (r.rating >= 8) AND (r.votes >= 50000) LIMIT 15; ----- - -== 5. Write a query that returns a list of just `episode_title_ids` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode. - -We did this already for the Friends show: - -[source,sql] ----- -%%sql -SELECT e.episode_title_id, u.primary_title FROM episodes AS e JOIN titles AS t ON e.show_title_id = t.title_id -JOIN titles AS u ON e.episode_title_id = u.title_id -WHERE show_title_id = 'tt0108778' LIMIT 5; ----- - -== 6. Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (5) as a subquery. Which episode has the highest rating? Is it also your favorite episode? - -We just join the `ratings` table and sort by the highest ratings. It is not surprising that the last episode of Friends has the highest rating. - -[source,sql] ----- -%%sql -SELECT e.episode_title_id, u.primary_title, r.rating FROM episodes AS e JOIN titles AS t ON e.show_title_id = t.title_id -JOIN titles AS u ON e.episode_title_id = u.title_id -JOIN ratings AS r ON e.episode_title_id = r.title_id -WHERE show_title_id = 'tt0108778' ORDER BY r.rating DESC LIMIT 5; ----- - - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project4Solutions.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project4Solutions.adoc deleted file mode 100644 index c648b63a8..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-Project4Solutions.adoc +++ /dev/null @@ -1,191 +0,0 @@ -= Think Summer: Project 4 Solutions -- 2022 - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -== 1. In the titles table, the premiered column specifies the year that a movie was premiered. Use COUNT to find how many movies premiered in each year in the database, in a single query. - -We can group by the year `premiered` and use the `COUNT` function. - -[source,sql] ----- -%%sql -SELECT COUNT(premiered), premiered FROM titles GROUP BY premiered LIMIT 15; ----- - -== 2. Use aliasing to rename the results of the COUNT function, so that rather than being labeled COUNT(*), the column appears as movies premiered. While it can be interesting to see the number of movies premiering long ago, perhaps we don’t need to see all of this information. Edit your query to only include movies from 1970+. - -We put a condition to limit the results to those with `premiered` from 1970 onwards, and we rename the column resulting from the `COUNT`: - -[source,sql] ----- -%%sql -SELECT COUNT(premiered) AS 'movies premiered', premiered FROM titles WHERE premiered >= 1970 GROUP BY premiered LIMIT 15; ----- - -== 3 part 1. First, write a query that gets the `episode_title_id` and `season_number` for every episode of our TV show. - -We use the suggested `show_title_id` and extract the `episode_title_id` and the `season_number` - -[source,sql] ----- -%%sql -SELECT episode_title_id, season_number FROM episodes WHERE show_title_id = 'tt0413573' LIMIT 5; ----- - -== 3 part 2. - -Next, use your query from part (1) as a sub-query, and get the `season_number`s for the seasons with an average `rating` of at least 8.0. The result should be a single column (`season_number`) with 10 values (if you are using `title_id` `tt0413573`). - -We extract the `season_number` for those seasons that have average rating 8 or higher. We `GROUP BY` the `season_number` and we use `HAVING` to ensure that `AVG(r.rating)` is 8 or more. - -[source,sql] ----- -%%sql -SELECT s.season_number FROM - -(SELECT episode_title_id, season_number FROM episodes WHERE show_title_id = 'tt0413573') AS s - -JOIN ratings r ON s.episode_title_id = r.title_id - -GROUP BY s.season_number -HAVING AVG(r.rating) >= 8; ----- - -== 3 part 3. Write a query that gets the `episode_number`, `season_number`, `primary_title`, and `title_id` for the TV show with your `title_id` (for example, `tt0413573`). Make sure to order the results first by `season_number` and then by `episode_number` - -We select these 4 variables, joining the `episodes` and `titles` tables, and ordering by the `season_number` and `episode_number`. - -[source,sql] ----- -%%sql -SELECT episode_number, season_number, primary_title, title_id - -FROM episodes AS e JOIN titles AS t -ON e.episode_title_id = t.title_id - -WHERE show_title_id = 'tt0413573' -ORDER BY season_number, episode_number LIMIT 15; ----- - -== 3 part 4. At this stage there are only 2 missing components to our query from part (3). First is the fact that all episodes from all seasons are returned. To address this, use logical `AND` and the `IN` operator to limit the returned episodes from your part (3) query to only the `season_number`s returned in your part (2) query. - - -We add an additional `AND` into the `WHERE` from the part 3 query, using `e.season_number IN` and checking to see whether this season number is in the subquery from part 2. - - -[source,sql] ----- -%%sql -SELECT e.episode_number, e.season_number, t.primary_title, t.title_id - -FROM episodes AS e JOIN titles AS t -ON e.episode_title_id = t.title_id - -WHERE show_title_id = 'tt0413573' - -AND e.season_number IN (SELECT s.season_number FROM - (SELECT episode_title_id, season_number FROM episodes WHERE show_title_id = 'tt0413573') AS s - JOIN ratings r ON s.episode_title_id = r.title_id - GROUP BY s.season_number - HAVING AVG(r.rating) >= 8) - -ORDER BY season_number, episode_number LIMIT 15; ----- - -== 3 part 5. Finally, the last missing component is the individual rating for each episode. Simply start with your query from part (4), and perform a join with the ratings table to get the rating for each episode. - -We join the `ratings` table, matching the `episode_title_id` from the `episodes` table with the `title_id` from the `ratings` table. - -[source,sql] ----- -%%sql -SELECT e.episode_number, e.season_number, t.primary_title, t.title_id, r.rating - -FROM episodes AS e JOIN titles AS t -ON e.episode_title_id = t.title_id - -JOIN ratings as r -ON e.episode_title_id = r.title_id - -WHERE show_title_id = 'tt0413573' -AND e.season_number IN (SELECT s.season_number FROM - (SELECT episode_title_id, season_number FROM episodes WHERE show_title_id = 'tt0413573') AS s - JOIN ratings r ON s.episode_title_id = r.title_id - GROUP BY s.season_number - HAVING AVG(r.rating) >= 8) - -ORDER BY season_number, episode_number LIMIT 15; ----- - -== Switching gears - -Now we switch from SQL to R. - -[source,R] ----- -%%R -library(data.table) -myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv") ----- - - -== 4. Use R to solve this question. (This question does not need a tapply.) What was the most popular day to travel in 2005, in terms of the total number of flights? What was the least popular day to travel? - -We paste together the `Year`, `Month`, and `DayofMonth`, and then tabulate the results using `table`. Then we `sort` the results and look at the most popular and least popular days to travel. - -The most popular day to travel is August 5, and the least popular day to travel is November 24. - -[source,R] ----- -%%R -head(sort(table(paste(myDF$Year, myDF$Month, myDF$DayofMonth)), decreasing=T)) ----- - -[source,R] ----- -%%R -head(sort(table(paste(myDF$Year, myDF$Month, myDF$DayofMonth)))) ----- - -== 5. Which airplane (listed by TailNum) flew the most miles altogether in 2005? - -We sum the mileage (i.e., the `Distance`) of the flights according to the `TailNum`, and we see that the airplane with `TailNum` `N550JB` flew the most miles, namely, more than 2 million miles. We also note that a lot of flights without a tail number listed are in the data set. - -[source,R] ----- -%%R -head(sort(tapply(myDF$Distance, myDF$TailNum, sum), decreasing=T)) ----- - -== 6. Among the three big New York City airports `(JFK, LGA, EWR)`, which of these airports had the worst `DepDelay` (on average) in 2005? (Can you solve this with 1 line of R, using a `tapply`, rather than 3 lines of R? Hint: After you run the tapply, you can index your results using `[c("JFK", "LGA", "EWR")]` to lookup all 3 airports at once.) - -We take the average of the `DepDelay`, split according to the `Origin`, and we remove the missing values. - -`JFK` has a 10.7 minute delay (on average). - -`LGA` has a 9.5 minute delay (on average). - -`EWR` has a 12.7 minute delay (on average). - -[source,R] ----- -%%R -sort(tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=T), decreasing=T)[c("JFK", "LGA", "EWR")] ----- - -== 7. Which flight path (i.e., which Origin-to-Dest pair) has the longest average departure delay? - -We find the average departure delays, split according to the Origin-to-Dest pairs, and we remove the missing values. We see that the flight path from `PIT` to `AVP` has a 345 minute departure delay (on average). - -FYI, there was only 1 flight from `PIT` to `AVP`, so this is something of an anomaly! - -[source,R] ----- -%%R -head(sort(tapply( myDF$DepDelay, paste(myDF$Origin, myDF$Dest), mean, na.rm=T), decreasing=T)) ----- - - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-more-SQL-examples-to-add-to-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-more-SQL-examples-to-add-to-notes.adoc deleted file mode 100644 index 46d462222..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-more-SQL-examples-to-add-to-notes.adoc +++ /dev/null @@ -1,122 +0,0 @@ -Use `LIKE` and/or R to get a count of how many movies (type='movie') that starts with each letter of the alphabet. Can you think of another way to do this? If so, show us, and explain what you did! - -You can read more about https://www.w3resource.com/sqlite/core-functions-like.php[SQLite LIKE] - -Here are two ways that you can get the number of titles, according to their first letter: - -We can use the `LIKE` function in `SQL`: - - - -or we can use the `substr` function in SQL, which gets a substring from the string. In this case, I used it to just get the first letter of the `primary_title`. - -[source,sql] ----- -%%sql -SELECT SUBSTR(primary_title,1,1), COUNT(SUBSTR(primary_title,1,1)) FROM titles GROUP BY SUBSTR(primary_title,1,1) ORDER BY COUNT(SUBSTR(primary_title,1,1)) DESC LIMIT 5; ----- - - -=== Question 1 - -A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values. This is enforced at the database level. - -A foreign key is a field whose value matches a primary key in a different table. - -Examine all six of the tables in the `imdb` database: -`akas`, `crew`, `episodes`, `people`, `ratings`, `titles`. -Identify each of the primary keys in each table, and identify each of the foreign keys in each table. - -// ==== - - -**Relevant topics:** https://www.geeksforgeeks.org/difference-between-primary-key-and-foreign-key/[primary and foreign keys] - -.Items to submit -==== -- List any primary keys in the tables. _(1 pt)_ -- List any foreign keys in the tables. _(1 pt)_ -- Any code you used to answer this question. -==== - -=== Question 2 - -If you paste a `title_id` to the end of the following url, it will pull up the page for the title. For example, https://www.imdb.com/title/tt0413573 leads to the page for the TV series Grey's Anatomy. Write a SQL query to confirm that the `title_id` "tt0413573" does indeed belong to Grey's Anatomy. Then browse https://imdb.com and find your favorite TV show. Get the `title_id` from the url of your favorite TV show, and run the following query to confirm that the TV show is in our database. - -[source, sql] ----- -SELECT * FROM titles WHERE title_id=''; ----- - -[IMPORTANT] -Make sure to replace "<title id here>" with the `title_id` of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database (that has multiple seasons). - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries] - -.Items to submit -==== -- SQL query used to confirm that `title_id` "tt0413573" does indeed belong to Grey's Anatomy. _(.5 pts)_ -- The output of the query. _(.5 pt)_ -- The `title_id` of your favorite TV show. _(.5 pts)_ -- SQL query used to confirm the `title_id` for your favorite TV show. _(.5 pts)_ -==== - -=== Question 3 - -The `episode_title_id` column in the `episodes` table references titles of individual episodes of a TV series. The `show_title_id` references the titles of the show itself. With that in mind, write a query that gets a list of all of the `episodes_title_id`'s (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode of Grey's Anatomy. - -[TIP] -https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_uhg3atol&flashvars%5BstreamerType%5D=auto&flashvars%5BlocalizationCode%5D=en&flashvars%5BleadWithHTML5%5D=true&flashvars%5BsideBarContainer.plugin%5D=true&flashvars%5BsideBarContainer.position%5D=left&flashvars%5BsideBarContainer.clickToClose%5D=true&flashvars%5Bchapters.plugin%5D=true&flashvars%5Bchapters.layout%5D=vertical&flashvars%5Bchapters.thumbnailRotator%5D=false&flashvars%5BstreamSelector.plugin%5D=true&flashvars%5BEmbedPlayer.SpinnerTarget%5D=videoHolder&flashvars%5BdualScreen.plugin%5D=true&flashvars%5BKaltura.addCrossoriginToIframe%5D=true&&wid=1_wmo98brv[This video] demonstrates how to extract titles of episodes in the `imdb` database. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins] - -.Items to submit -==== -- SQL query used to answer the question. _(3 pts)_ -- Output from running the SQL query. _(2 pts)_ -==== - -=== Question 4 - -Joins are a critical concept to understand. They appear everywhere where relational data is found. In R, the `merge` function performs the same operations as joins. In python's `pandas` package the `merge` method for the `DataFrame` object performs the same operations. Take some time to read xref:programming-languages:SQL:joins.adoc[this section]. - -In question 2 from the previous project, we asked you to use the `ratings` table to discover how many films have a rating of at least 8 and at least 50000 votes. You may have noticed, while you can easily do that, the end result is not human understandable. We see that there are films with those features but we don't know what film `title_id` "tt0010323" is for. This is a great example where a simple join can answer this question for us. - -Write a query that prints the `primary_title`, `rating`, and `votes` for all films with a rating of at least 8 and at least 50000 votes. Like in the previous version of this question, limit your output to 15 results. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins] - -.Items to submit -==== -- SQL query used to answer the question. _(3 pts)_ -- Output from running the SQL query. -==== - -[WARNING] -==== -The following are challenge questions and are worth 0 points. If you get done early give them a try! -==== - -=== Question 5 - -We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in <<question-2, question 2>>. In order to do so, we will break the task into two parts in (5) and (6). First, write a query that returns a list of _just_ `episode_title_ids` (found in the `episodes` table), with the associated `primary_title` (found in the `titles` table) for each episode. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins] - -.Items to submit -==== -- SQL query used to answer the question. -- Output from running the SQL query. -==== - -=== Question 6 - -Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (5) as a subquery. Which episode has the highest rating? Is it also your favorite episode? - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins] - -.Items to submit -==== -- SQL query used to answer the question. -- Output from running the SQL query. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-04.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-04.adoc deleted file mode 100644 index b2e0f22b7..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-04.adoc +++ /dev/null @@ -1,135 +0,0 @@ -=== Question 1 - -Aggregate functions like `COUNT`, `AVG`, `SUM`, `MIN`, and `MAX` are very useful. In particular, running queries like the following are great. - -[source, sql] ----- -SELECT COUNT(*) FROM titles WHERE premiered = 2008; ----- - -However, in this scenario we want to know how many movies premiered in 2008. How often would we rather just see these numbers for _every_ year, rather than 1 year at a time? This is where aggregate functions really start to have more power. - -In the `titles` table, the `premiered` column specifies the year that a movie was premiered. Use `COUNT` to find how many movies premiered in each year in the database, in a single query. - -[IMPORTANT] -Use **only** SQL to answer this question. - -[NOTE] -If you feel like practicing your R skills, try and solve this using R instead of SQL (for 0 points). - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL query used to solve the question. _(1.5 pts)_ -- Output from running the code. _(.5 pts)_ -==== - -=== Question 2 - -In <<question-1, question (1)>>, we have an example that starts to demonstrate how those simple aggregate functions are really quite powerful. The results, however, do have some ways that they could be improved. Improve your solution to <<question-1, question (1)>> in the following ways: - -. Use xref:programming-languages:SQL:aliasing.adoc[aliasing] to rename the results of the `COUNT` function, so that rather than being labeled `COUNT(*)`, the column appears as `movies premiered`. -. While it can be interesting to see the number of movies premiering long ago, perhaps we don't need to see all of this information. Edit your query to only include movies from 1970+. - -[IMPORTANT] -Use **only** SQL to answer this question. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL query used to solve the question. _(1.5 pts)_ -- Output from running the code. _(.5 pts)_ -==== - -=== Question 3 - -Let's try to combine a little bit of everything we've learned so far. In the previous project, you picked a TV series to perform queries on. Use that same TV series (or, if you don't want to choose a TV series, title_id "tt0413573" is a good one to use) for this question. - -We want to get the `episode_number`, `season_number`, `primary_title`, `title_id`, and `rating` of every episode of your TV series for _only_ seasons where the average `rating` was at least X, in a single query. - -This will be a large query with multiple joins, and sub-queries. For this reason, we will break this down into parts, each worth some points. - -==== Part 1 - -First, write a query that gets the `episode_title_id` and `season_number` for every episode of our TV show. - -==== Part 2 - -Next, use your query from <<part-1, part (1)>> as a sub-query, and get the `season_number`s for the seasons with an average `rating` of at least 8.0. The result should be a single column (`season_number`) with 10 values (if you are using title_id "tt0413573"). - -[IMPORTANT] -==== -Remember that a TV show may have an overall rating _and_ individual episode ratings. For example, for Grey's Anatomy, you can get the overall rating by running this query. - -[source, sql] ----- -SELECT rating FROM ratings WHERE title_id = 'tt0413573'; ----- - -But, we want you to get the average rating, by season. -==== - -==== Part 3 - -Write a query that gets the `episode_number`, `season_number`, `primary_title`, and `title_id` for the TV show with your title_id (for example, "tt0413573"). Make sure to order the results first by `season_number` and then by `episode_number`. - -==== Part 4 - -At this stage there are only 2 missing components to our query from <<part-3, part (3)>>. First is the fact that _all_ episodes from _all_ seasons are returned. To address this, use logical `AND` and the `IN` operator to limit the returned episodes from your <<part-3, part (3) query>> to only the `season_number`s returned in your <<part-2, part (2) query>>. - -[TIP] -==== -This may _sound_ difficult, but it isn't! Start with your <<part-3, part (3) query>>, and tack on `AND <column name> IN (<sub query>)`. Of course, you need to fill in `<column name>` with the correct column name, and `<sub query>` with our <<part-2, part (2) query>>. -==== - -==== Part 5 - -Finally, the last missing component is the individual `rating` for each episode. Simply start with your query from <<part-4, part (4)>>, and perform a join with the `ratings` table to get the `rating` for each episode. - -In addition, the `rating` isn't available in our query from <<part-3, part (3)>>. - -**Relevant topics:** - -.Items to submit -==== -- SQL queries for each of parts 1 - 5. _(.5 pts each)_ -- Output from running queries from each of parts 1 - 5. _(.5 pts each)_ -==== - -[TIP] -==== -Use the `tapply` function in R to solve the Questions 5, 6, 7 below. -==== - -=== Question 7 - -Which flight path (i.e., which Origin-to-Dest pair) has the longest average departure delay? - - -From 2021 Project 4: - -=== Question 3 - -Who used `LIMIT` and `ORDER BY` to update your query from <<question-2, question (2)>>? While that is one way to solve that question, the more robust way would be to use the `HAVING` clause. Use `HAVING` to limit the query to only include movies premiering in 1970+. - -**Relevant topics:** - -.Items to submit -==== -- SQL query used to solve the question. _(.5 pts)_ -- Output from running the code. _(.5 pts)_ -==== - - - - - - - - - - - - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-05.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-05.adoc deleted file mode 100644 index f8a3c7e37..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-project-05.adoc +++ /dev/null @@ -1,64 +0,0 @@ -=== Question 1 - -What is the `primary_title` of a TV Series, movie, short, etc. (any `type` in the `titles` table), that has been most widely distributed. Of course, we don't _really_ have the information we need to answer this question, however, let's consider the most widely distributed piece of film to be the `title_id` that appears most in the `akas` table. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL used to solve the problem. -- Output from running the code. -==== - -=== Question 2 - -What is the average rating of movies (specifically, when `titles` table `type` is `movie`), with at least 10,000 `votes` (from the `ratings` table) by year in which they premiered? Use SQL in combination with R to answer this question and create a graphic that illustrates the ratings. Do you notice any trends? If so, what? - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- R code used to solve the problem. -- Output from running the code (including the graphic). -- 1-2 sentences explain what, if any, trends you see. -==== - -=== Question 3 - -Get the name and number of appearances (count of `person_id` from the `crew` table) of the top 15 people from the `people` table with the most number of appearances. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL used to solve the problem. -- Output from running the code. -==== - -=== Question 4 - -Wow! Those are some pretty large numbers! What if we asked the same question, but limited appearances to only items with at least 10000 votes? Write an SQL query, and compare the results. How did results shift? Are there any apparent themes in the results? - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL used to solve the problem. -- Output from running the code. -==== - -=== Question 5 - -You've had a long time working with this database! Now it is time for you to come up with a question to ask yourself (about the database), answer the question using a combination of SQL and/or R. Using your results, create the most interested and tricked out graphic you can come up with! The base R plotting functions and a library called `ggplot` are the best tools for the job (at least in R). - -Too easy? Create multiple graphics on the same page (maybe for a different TV series or genre), and theme everything to look like a nice, finished product. - -Still too easy? Create a function in R that, given a `title_id`, queries the database and generates a customized graphic based on the movie or tv series provided. You could go as far as scraping an image from imdb.com and using it as a backsplash for your image. Get creative and make your masterpiece. - -**Relevant topics:** xref:programming-languages:SQL:index.adoc[SQL], xref:programming-languages:SQL:queries.adoc[queries], xref:programming-languages:SQL:joins.adoc[joins], xref:programming-languages:SQL:aggregate-functions.adoc[aggregate functions] - -.Items to submit -==== -- SQL used to solve the problem. -- Output from running the code. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-subquery-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-subquery-notes.adoc deleted file mode 100644 index c4fc2663d..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2022-subquery-notes.adoc +++ /dev/null @@ -1,41 +0,0 @@ -= Think Summer: Subquery notes about how to use the results of one query within another query -- 2022 - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -We can find the information from the episodes of Friends: - -[source,sql] ----- -%%sql -SELECT * FROM episodes WHERE show_title_id = 'tt0108778' LIMIT 5; ----- - -and then we can use this query as a subquery, to select the season numbers from the episodes of Friends: - -[source,sql] ----- -%%sql -SELECT s.season_number -FROM (SELECT * FROM episodes WHERE show_title_id = 'tt0108778') AS s LIMIT 5; ----- - -Here is another example. We can find the movies from 1989: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (premiered = '1989') AND (type = 'movie') LIMIT 5; ----- - -and then we can use this query as a subquery, to find the average length of movies from 1989, given in minutes: - -[source,sql] ----- -%%sql -SELECT AVG(s.runtime_minutes) -FROM (SELECT * FROM titles WHERE (premiered = '1989') AND (type = 'movie')) AS s LIMIT 5; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-REEU.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-REEU.adoc deleted file mode 100644 index 145a7bdd6..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-REEU.adoc +++ /dev/null @@ -1,237 +0,0 @@ -= REEU: Day 1 Notes -- 2023 - -== Loading the database - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -== Extracting a few rows from the each of the 6 tables - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM akas LIMIT 5; ----- - -== We can ask for more than 5 rows too. For instance, here we ask for 15 rows instead of 5 rows. - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 15; ----- - -== We can see how many rows were in each table, as follows: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM people LIMIT 5; ----- - -== We can also start to investigate individual people, for instance: - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Ryan Reynolds' LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Hayden Christensen' LIMIT 5; ----- - -== Say Anything is one of Dr Ward's favorite movies. We can find it here: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0098258' LIMIT 5; ----- - -== Friends is one of Dr Ward's favorite shows. We can find it here: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5; ----- - -== We can investigate how many titles premiered in each year, by grouping things together according to the year that the title premiered, and by ordering the results according to the year that the title premiered. The "desc" specifies that we want the results in descending order, i.e., with the largest result first (where "largest" means the "last year", because we are ordering by the years). - -[source,sql] ----- -%%sql -SELECT COUNT(*), premiered FROM titles -GROUP BY premiered ORDER BY premiered DESC LIMIT 20; ----- - -== The Family Guy premiered in 1999 and ended in 2022. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0182576' LIMIT 5; ----- - -== Tobey Maguire was born in 1975 - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0001497' LIMIT 5; ----- - -== Brent Spiner was born in 1949: -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0000653' LIMIT 5; ----- - -== Brent Spiner was on the crew for 75 movies and TV shows (this may include individual episodes). -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0000653' LIMIT 5; ----- - -== Jennifer Aniston was born in 1969: -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Jennifer Aniston' LIMIT 5; ----- - -== There are a total of 8064259 titles in the titles table. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -== There were 8107 people from IMDB born in 1976: -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM people WHERE born = 1976 LIMIT 5; ----- - -== We can get the number of people born in every year from 1976 onwards: -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born >= 1976 GROUP BY born LIMIT 5; ----- - -== Here are some tvSeries that premiered since the year 2000: -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (premiered >= 2000) AND (type = 'tvSeries') LIMIT 5; ----- - -== These are the first 5 people in the people table. - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -== These are the first 5 episodes in the episodes table. - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -== These are the first 5 people in the crew table. - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -== Only 3 movies have more than 2 million ratings - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE votes > 2000000 LIMIT 5; ----- - -== Let's find how many people were born in each year (after 1850). This is part of Question 1. - -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born > 1850 -GROUP BY born LIMIT 200; ----- - -== The Family Guy has 374 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0182576' LIMIT 5; ----- - -== These are five of the films where George Lucas was on the crew. - -[source,sql] ----- -%%sql -SELECT * FROM crew WHERE person_id = 'nm0000184' LIMIT 5; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-think-summer.adoc deleted file mode 100644 index 1f06d3202..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day1-notes-think-summer.adoc +++ /dev/null @@ -1,179 +0,0 @@ -= Think Summer: Day 1 Notes -- 2023 - -== Loading the database - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -== Extracting a few rows from the each of the 6 tables - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM akas LIMIT 5; ----- - -== We can see how many rows were in each table, as follows: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM people LIMIT 5; ----- - -== We can also start to investigate individual people, for instance: - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Ryan Reynolds' LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Hayden Christensen' LIMIT 5; ----- - -== Friends is one of Dr Ward's favorite shows. We can find it here: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5; ----- - -== We can investigate how many titles premiered in each year, by grouping things together according to the year that the title premiered, and by ordering the results according to the year that the title premiered. The "desc" specifies that we want the results in descending order, i.e., with the largest result first (where "largest" means the "last year", because we are ordering by the years). - -[source,sql] ----- -%%sql -SELECT COUNT(*), premiered FROM titles -GROUP BY premiered ORDER BY premiered DESC LIMIT 20; ----- - -== The Family Guy premiered in 1999 and ended in 2022. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0182576' LIMIT 5; ----- - -== Tobey Maguire was born in 1975 - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0001497' LIMIT 5; ----- - -== There are a total of 8064259 titles in the titles table. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -== These are the first 5 people in the people table. - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -== These are the first 5 episodes in the episodes table. - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -== These are the first 5 people in the crew table. - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -== Only 3 movies have more than 2 million ratings - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE votes > 2000000 LIMIT 5; ----- - -== Let's find how many people were born in each year (after 1850). This is part of Question 1. - -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born > 1850 -GROUP BY born LIMIT 200; ----- - -== The Family Guy has 374 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0182576' LIMIT 5; ----- - -== These are five of the films where George Lucas was on the crew. - -[source,sql] ----- -%%sql -SELECT * FROM crew WHERE person_id = 'nm0000184' LIMIT 5; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day2-notes-REEU-and-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day2-notes-REEU-and-think-summer.adoc deleted file mode 100644 index 10437eba3..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day2-notes-REEU-and-think-summer.adoc +++ /dev/null @@ -1,442 +0,0 @@ -= Think Summer: Day 2 Notes -- 2023 - -== Loading the database - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - - -== Find and print the `title_id`, `rating`, and number of votes (`votes`) for all movies that received at least 2 million votes. -In a second query (and new cell), use the information you found in the previous query to identify the `primary_title` of these movies. - -These are the movies with at least 2 million votes: - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE votes >= 2000000 LIMIT 5; ----- - -and then we can lookup their titles: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0111161' OR title_id = 'tt0468569' OR title_id = 'tt1375666' LIMIT 5; ----- - -Later today, we will learn an easier way to find the titles of the movies, by learning how to `JOIN` the information in two or more tables. - - - -== How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years. - -We use the condition that `died-born` is bigger than 115 - -[source,sql] ----- -%%sql -SELECT *, died-born FROM people WHERE died-born > 115 LIMIT 10; ----- - -Now we can use the `COUNT` function to see that there are 7 such actors who lived more than 115 years. - -[source,sql] ----- -%%sql -SELECT COUNT(died-born) FROM people WHERE died-born > 115 LIMIT 5; ----- - - -== Use the `ratings` table to discover how many films have a rating of at least 8 and at least 50000 votes. In a separate cell, show 15 rows with this property. - -We can use conditions to ensure that rating and votes are large enough, -and then we can display 15 such results. - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15; ----- - -Then we can use the `COUNT` function to see that there are 670 such titles altogether. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15; ----- - - - - -== Find the `primary_title` of every _movie_ that is over 2 hours long or that premiered after 1990. Order the result from newest premiered year to oldest, and limit the output to 15 movies. Make sure `premiered` and `runtime_minutes` are not `NULL`. After displaying these 15 movies, run the query again in a second cell, but this time only display the number of such movies. - -We just add the conditions to the query about the titles table. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (type == 'movie') AND (runtime_minutes IS NOT NULL) AND (premiered IS NOT NULL) AND ((runtime_minutes > 120) OR (premiered > 1990)) ORDER BY premiered DESC LIMIT 15; ----- - -Now we can find the total number of such movies, using the `COUNT`: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles WHERE (type == 'movie') AND (runtime_minutes IS NOT NULL) AND (premiered IS NOT NULL) AND ((runtime_minutes > 120) OR (premiered > 1990)) ORDER BY premiered DESC LIMIT 15; ----- - -This can be a helpful time to mention the concept of https://stackoverflow.com/questions/45231487/order-of-operation-for-and-and-or-in-sql-server-queries[order of operations] - -== What movie has the longest primary title? Answer this question using just SQL. - -You can read more about https://www.w3resource.com/sqlite/core-functions-length.php[SQLite length] - -We can use the `length` function, as follows: - -[source,sql] ----- -%%sql -SELECT *, length(primary_title) FROM titles ORDER BY length(primary_title) DESC LIMIT 5; ----- - -== What actor has the longest name? Answer this question using just SQL. - -[source,sql] ----- -%%sql -SELECT *, length(name) FROM people ORDER BY length(name) DESC LIMIT 5; ----- - - - - - -== We already mentioned that there are six tables in the database: `akas`, `crew`, `episodes`, `people`, `ratings`, `titles` - -Normally, when using SQLite, the easiest way to display the tables in the database is by running `.table` or `.tables`. This is SQLite-specific behavior and therefore cannot be used in our Jupyter Lab environment. Instead, to show the tables using an R cell, we can run the following. - -[source, sql] ----- -%%sql -SELECT - name -FROM - sqlite_master -WHERE - TYPE IN('table', 'view') - AND name NOT LIKE 'sqlite_%' -ORDER BY - 1; ----- - -Once we learn to use R to connect to the database, if `conn` is a database connection, we can just use the `dbListTables` command, to do the same thing. - -[source,r] ----- -%%R -library(RSQLite) -conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db") -dbListTables(conn) ----- - -We also have xref:programming-languages:SQL:index.adoc[some additional information about SQL] posted in our book pages. - - -== Avoiding `NULL` values, and making calculations within our SQL queries - -We can start by loading the `titles` table. - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -and then making sure that we avoid rows in which `premiered` is `NULL` and the rows in which `ended` is `NULL`. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) LIMIT 5; ----- - -Then we can calculate the difference between the year that the show `ended` and the year that the show `premiered`. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) LIMIT 5; ----- - -We can given this new variable a name. For instance, we might use `mylength` to refer to the show's run on TV (in years). Then we can order the results by `mylength` in years, given in `DESC` (descending) order. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered AS mylength FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) ORDER BY mylength DESC LIMIT 5; ----- - -For instance, this allows us to see that the show `Allen and Kendal` was running from 1940 to 2015, for a total of 75 years. - -== How long was Friends on TV? - -We can use the query above as a starting point, just looking up `Friends` as the title, and seeing which shows with that title were on TV after 1993. We see that `Friends` was on TV for 10 years. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered AS mylength FROM titles -WHERE (premiered IS NOT NULL) AND (ended IS NOT NULL) -AND (primary_title = 'Friends') AND (premiered > 1993) LIMIT 5; ----- - -== How many types of titles are there? - -Here are a few of the types of titles - -[source,sql] ----- -%%sql -SELECT type FROM titles LIMIT 5; ----- - -There are lots of repeats, so we ask for `DISTINCT` types, i.e., removing the repetitions. - -[source,sql] ----- -%%sql -SELECT DISTINCT type FROM titles LIMIT 5; ----- - -and now we can ask for a few more, i.e., we can increase the limit. - -[source,sql] ----- -%%sql -SELECT DISTINCT type FROM titles LIMIT 100; ----- - -Looks like there are 12 types altogether: `short`, `movie`, `tvShort`, `tvMovie`, `tvSeries`, `tvEpisode`, `tvMiniSeries`, `tvSpecial`, `video`, `videoGame` `radioSeries`, `radioEpisode` - -[source,sql] ----- -%%sql -SELECT COUNT(DISTINCT type) FROM titles LIMIT 100; ----- - -== How many times did each type occur? - -We can group the types and count each of them. For instance, there are 5897385 tvEpisodes and there are 581731 movies. - -[source,sql] ----- -%%sql -SELECT COUNT(*), type FROM titles GROUP BY type LIMIT 100; ----- - -== How many times did each genre occur? - -At first, we view the genres as tuples, for instance, `Action,Adult` is a genre (separated by commas). We can do this the same as we did above, just changing the variable type to the variable genres. - -[source,sql] ----- -%%sql -SELECT COUNT(*), genres FROM titles GROUP BY genres LIMIT 100; ----- - -Now we see that there are 2283 such genres: - -[source,sql] ----- -%%sql -SELECT COUNT(DISTINCT genres) FROM titles LIMIT 5; ----- - -[TIP] -==== -We will come back to the question above, about the total number of genres, when we learn how to import SQL queries into R dataframes. -==== - - -== How many times has The Awakening been used as a title? - -The Awakening has been used 131 times as a title - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles WHERE primary_title = 'The Awakening' LIMIT 5; ----- - - - - - -== Now we can learn about how to `JOIN` the results of queries from two or more tables. Using a `JOIN` is a powerful way to leverage lots of information from a database, but it takes a little time to set things up properly. First, we revisit a question from yesterday, about the movies that received at least 2 million votes. We want to find the titles of those movies. - -We will need the `titles` table and the `ratings` table. - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -Now we join these two tables, and restrict the results to those movies with at least 2000000 votes. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN ratings AS r -ON t.title_id = r.title_id WHERE votes > 2000000 LIMIT 5; ----- - -== What was the most popular movie (highest rating) in the year your Mom or Dad or aunt, etc., was born? - -The most popular movie that premiered in 1940 was The Great Dictator, with a rating of 8.4. It is a Charlie Chaplin movie that criticizes the dictators of the time, who were becoming very powerful in Europe. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN ratings AS r ON t.title_id = r.title_id - WHERE (t.premiered = 1940) AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5; ----- - - - - -== How many episodes of Friends were there? - -We start by finding the `title_id` for Friends. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5; ----- - -So now we know that `tt0108778` is the `show_title_id` for Friends. - -Now we find the number of episodes per season. To do this, we first find the episodes for Friends. - -[source,sql] ----- -%%sql -SELECT * FROM episodes WHERE show_title_id = 'tt0108778' LIMIT 5; ----- - -and then we group them by `season_number`, to make sure that our results make sense. - -[source,sql] ----- -%%sql -SELECT COUNT(*), season_number FROM episodes WHERE show_title_id = 'tt0108778' GROUP BY season_number; ----- - -Season 10 differs from what I expected (I was guessing that there would be 18 episodes), so I checked further on this. - -[source,sql] ----- -%%sql -SELECT * FROM episodes AS e JOIN titles AS t ON e.episode_title_id = t.title_id WHERE show_title_id = 'tt0108778' AND season_number = 10 ORDER BY episode_number; ----- - -OK so they combined The Last One, which is two episodes, into just one listing. - -So there are 235 episodes listed, although there were actually 236 episodes in the show altogether! - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0108778'; ----- - - - - -== Who are the actors and actresses in the TV show Friends? - -We will need the `people` table and the `crew` table. - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -Now we join these two tables together. - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c JOIN people AS p ON c.person_id = p.person_id LIMIT 5; ----- - -and now we also join with the `titles` table, and we focus on the `title_id` for Friends, which is `tt0108778`. There are 10 people listed, from the Friends TV show. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id -JOIN people AS p ON c.person_id = p.person_id -WHERE t.title_id = 'tt0108778' LIMIT 50; ----- - -and 8 of them are actors or actresses - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id -JOIN people AS p ON c.person_id = p.person_id -WHERE (t.title_id = 'tt0108778') -AND ((c.category = 'actress') OR (c.category = 'actor')) LIMIT 50; ----- - -== How many movies has Emma Watson appeared in? - -She has appeared in a total of 18 movies. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id - JOIN people AS p ON c.person_id = p.person_id - WHERE (p.name = 'Emma Watson') AND (t.type = 'movie'); ----- - - - -== James Caan died in 2022. You can read his https://en.wikipedia.org/wiki/James_Caan[Wikipedia page] or his https://www.imdb.com/name/nm0001001/[IMDB page]. What was his highest rated movie? - -He appeared in The Godfather, which has a rating of 9.2 - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id - JOIN people AS p ON c.person_id = p.person_id - JOIN ratings AS r ON t.title_id = r.title_id - WHERE (p.name = 'James Caan') AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day3-notes-REEU-and-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day3-notes-REEU-and-think-summer.adoc deleted file mode 100644 index 9d82d0562..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day3-notes-REEU-and-think-summer.adoc +++ /dev/null @@ -1,376 +0,0 @@ -= Think Summer: Day 3 Notes -- 2023 - -Loading the R `data.table` library and loading the data from the 2005 airline data set - -[source,R] ----- -%%R -library(data.table) ----- - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv") ----- - -A dataframe in R, by the way, is a lot like an Excel spreadsheet or a SQL table. A dataframe has columns of data, with one type of data in each column. The columns are usually long. In other words, there are usually many rows in the dataframe. - -These are the first few lines of the 2005 airline data set - -[source,R] ----- -%%R -head(myDF) ----- - -There are 7 million rows and 29 columns - -[source,R] ----- -%%R -dim(myDF) ----- - -The first few flights are departing from Boston or O'Hare - -[source,R] ----- -%%R -head(myDF$Origin) ----- - -The first few flights are arriving at Boston or O'Hare - -[source,R] ----- -%%R -head(myDF$Dest) ----- - -The last few flights are departing and arriving as follows: - -[source,R] ----- -%%R -tail(myDF$Origin) ----- - -[source,R] ----- -%%R -tail(myDF$Dest) ----- - -We can use `n=50` to get the destinations of the first 50 flights and the destinations of the last 50 flights. - -[source,R] ----- -%%R -head(myDF$Dest, n=50) ----- - -[source,R] ----- -%%R -tail(myDF$Dest, n=50) ----- - -If we find out how many times an airplane departed from each airport, we get these counts: - -[source,R] ----- -%%R -head(table(myDF$Origin), n=10) ----- - -Now we can sort those counts, in descending order (i.e., with the largest ones given first), and display the largest such 10 counts. - -[source,R] ----- -%%R -head(sort(table(myDF$Origin), decreasing=T), n=10) ----- - -Now we can display how many flights departed from each of the 10 most popular airports. - -[source,R] ----- -%%R -dotchart(head(sort(table(myDF$Origin), decreasing=T), n=10)) ----- - -We can extract the number of flights from specific airports, by looking the data up by the airports as indices. Note that we are only selecting from the 10 most popular airports here. - -[source,R] ----- -%%R -head(sort(table(myDF$Origin), decreasing=T), n=10)[c("ATL","CVG","ORD")] ----- - -Here is another example, in which we extract the number of flights from airports which may or may not be among the most popular 10 airports. - -[source,R] ----- -%%R -sort(table(myDF$Origin), decreasing=T)[c("EWR","IND","JFK","ORD")] ----- - -We can paste together the first 300 origin airports and the first 300 destination airports. - -[source,R] ----- -%%R -paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-") ----- - -Then we can tabulate how many times each such flight path was flown. - -[source,R] ----- -%%R -table(paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-")) ----- - -Now that this works, we can remove the heads on each of those data sets. Then we can tabulate the number of times that every flight path was used, and sort those results, and finally we can display the 100 most popular flight paths overall. - -[source,R] ----- -%%R -head(sort(table(paste(myDF$Origin, myDF$Dest, sep="-")), decreasing=T), n=100) ----- - -When we use the `table` function in R, in the result, we have a row of names followed by a row of data. Then we have another row of names followed by a row of data, etc., etc. R always displays data from a table in this way, namely, by alternating a row of names and a row of data. You can think about how things would look different (and easier) if your screen was really, really wide, and there were only two rows displayed, namely, the names and the data. - -These are the airline carriers for the first 6 flights. - -[source,R] ----- -%%R -head(myDF$UniqueCarrier) ----- - -We can see how many flights were flown with each carrier. - -[source,R] ----- -%%R -sort(table(myDF$UniqueCarrier), decreasing=T) ----- - -The overall average departure delay, across all flights, is 8.67 minutes: - -[source,R] ----- -%%R -mean(myDF$DepDelay, na.rm=T) ----- - -We can just restrict attention to the average departure delay for flights departing from `IND` or from `JFK`. - -[source,R] ----- -%%R -mean(myDF$DepDelay[myDF$Origin=="IND"], na.rm=T) ----- - -[source,R] ----- -%%R -mean(myDF$DepDelay[myDF$Origin=="JFK"], na.rm=T) ----- - -These are the first 100 departure delays for flights from Indianapolis to Chicago. - -[source,R] ----- -%%R -head(myDF$DepDelay[(myDF$Origin=="IND") & (myDF$Dest=="ORD")], n=100) ----- - -The first 6 department delays for flights from Boston or flights from Indianapolis are: - -[source,R] ----- -%%R -head(myDF$DepDelay[myDF$Origin == "BOS"]) ----- - -[source,R] ----- -%%R -head(myDF$DepDelay[myDF$Origin == "IND"]) ----- - -We could make a table of departure delays for flights from Indianapolis: - -[source,R] ----- -%%R -table(myDF$DepDelay[myDF$Origin == "IND"]) ----- - -and we can plot the distribution of departure delays: - -[source,R] ----- -%%R -plot(table(myDF$DepDelay[myDF$Origin == "IND"])) ----- - -and we can add conditions to this. For instance, if we only want to see the distribution of delays that are less than 1 hour: - -[source,R] ----- -%%R -plot(table(myDF$DepDelay[(myDF$Origin == "IND") & (myDF$DepDelay < 60)])) ----- - - - - -Now we switch gears and load the donation data from federal election campaigns in 2000. This data is described here: -https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/[Contributions by individuals file description] - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/election/itcont2000.txt") ----- - -The first several rows of election data are: - -[source,R] ----- -%%R -head(myDF) ----- - -There are 1.6 million rows and 21 columns - -[source,R] ----- -%%R -dim(myDF) ----- - -Altogether, there were 1.8 billion dollars in contributions - -[source,R] ----- -%%R -sum(myDF$TRANSACTION_AMT) ----- - -The largest number of contributions (regardless of the size of the contributions) were made by residents of `CA`, `NY`, `TX`, etc. - -[source,R] ----- -%%R -sort(table(myDF$STATE), decreasing=T) ----- - -We can paste the first 6 cities and the first 6 states together, using the `paste` function: - -[source,R] ----- -%%R -head(myDF$CITY) ----- - -[source,R] ----- -%%R -head(myDF$STATE) ----- - -[source,R] ----- -%%R -paste(head(myDF$CITY), head(myDF$STATE)) ----- - -Then we can tabulate how many times those 6 city-state pairs occur, and sort the results, and display the head. - -[source,R] ----- -%%R -head(sort(table(paste(head(myDF$CITY), head(myDF$STATE))), decreasing=T)) ----- - -Now that this works for the first 6 city-state pairs, we can do this again for the entire data set. We see that the most donations were made from some typically large cities. There are also a lot of donations from unknown locations. - -[source,R] ----- -%%R -head(sort(table(paste(myDF$CITY, myDF$STATE)), decreasing=T)) ----- - -Here are the names of the people who made the largest number of contributions (regardless of the size of the contributions themselves) - -[source,R] ----- -%%R -head(sort(table(myDF$NAME), decreasing=T)) ----- - -Now we can learn how to use the `tapply` function. - -The `tapply` function takes three things, namely, some data, some groups to sort the data, and a function to run on the data. - -For instance, we can take the data about the election transaction amounts, and split the data according the state where the donation was made, and sum the dollar amounts of those election donations within each state. - -[source,R] ----- -%%R -head(sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum), decreasing=T)) ----- - -We can do something similar, now summing the amounts of the transactions in dollars, splitting the data according to the name of the donor: - -[source,R] ----- -%%R -head(sort(tapply(myDF$TRANSACTION_AMT, myDF$NAME, sum), decreasing=T), n=20) ----- - -Now we return to the airline data set from 2005: - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv") ----- - -We can take an average of the departure delays, split according to the airline for the flights: - -[source,R] ----- -%%R -tapply( myDF$DepDelay, myDF$UniqueCarrier, mean, na.rm=T ) ----- - -We can sum the distances of the flights according to the airports where the flights departed: - -[source,R] ----- -%%R -head(sort( tapply( myDF$Distance, myDF$Origin, sum ), decreasing=T )) ----- - -We can take an average of the arrival delays according to the destination where the flights landed. - -[source,R] ----- -%%R -head(sort( tapply( myDF$ArrDelay, myDF$Dest, mean, na.rm=T ), decreasing=T )) ----- - - - - - - - - - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day4-notes-REEU-and-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day4-notes-REEU-and-think-summer.adoc deleted file mode 100644 index 988822f0e..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day4-notes-REEU-and-think-summer.adoc +++ /dev/null @@ -1,209 +0,0 @@ -= Think Summer: Day 4 Notes -- 2023 - -We always need to re-load the libraries, if our kernel dies or if we start an all-new session. - -Loading the R `data.table` library - -[source,R] ----- -%%R -library(data.table) ----- - -== Loading the R library for SQL, and loading the database - -We need to run this, to make a connection to the database at the start. If something goes wrong with our database queries, we can always come back and run these two lines again. Ideally, we should only need to run these once per session, but sometimes we make mistakes, and our kernel dies, and we need to run these lines again. - -[source,R] ----- -%%R -library(RSQLite) -conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db") ----- - -== Importing data from SQL to R - -For example, we can import the number of `titles` per year from SQL into R. (We are doing the work in SQL of finding out how many titles occurred in each year.) - -[source,R] ----- -%%R -myDF <- dbGetQuery(conn, "SELECT COUNT(*), premiered FROM titles GROUP BY premiered;") ----- - -Let's first look at the `head` of the result: - -[source,R] ----- -%%R -head(myDF) ----- - -We can assign names to the columns of the data frame: - -[source,R] ----- -%%R -names(myDF) <- c("mycounts", "myyears") ----- - -and now the `head` of the data frame looks like this: - -[source,R] ----- -%%R -head(myDF) ----- - -Finally, we are prepared to plot the number of titles per year. We plot the years on the x-axis and the counts on the y-axis: - -[source,R] ----- -%%R -plot(myDF$myyears, myDF$mycounts) ----- - -Another way to do this is to import all of the years that titles premiered, and then make a table in `R` and plot the table. (This time, we are doing the work in R of finding out how many titles occurred in each year.) - -[source,R] ----- -%%R -myDF <- dbGetQuery(conn, "SELECT premiered FROM titles;") ----- - -[source,R] ----- -%%R -head(myDF) ----- - -[source,R] ----- -%%R -tail(myDF) ----- - -Now we make a table of the results: - -[source,R] ----- -%%R -table(myDF$premiered) ----- - -and we can plot the results: - -[source,R] ----- -%%R -plot(table(myDF$premiered)) ----- - -== How many genres are in the titles table? - -Here are the first few genres from the genres column: - -[source,R] ----- -%%R -myDF <- dbGetQuery(conn, "SELECT genres FROM titles;") -head(myDF$genres) ----- - -We use the `head` so that we can keep the output relatively small and manageable. Now we can remove duplicates: - -[source,R] ----- -%%R -unique(head(myDF$genres)) ----- - -and see how many `unique` values there are, in the `head`: - -[source,R] ----- -%%R -length(unique(head(myDF$genres))) ----- - -Now that this works well, we can remove the `head` restriction, and see that there are 2283 unique genres in the table altogether. Remember that each genre is actually a tuple of genres, for instance, - -[source,R] ----- -%%R -length(unique(myDF$genres)) ----- - -By the way, as a side note, we could verify this directly in SQL this way: - -[source,R] ----- -%%R -dbGetQuery(conn, "SELECT COUNT(DISTINCT genres) FROM titles;") ----- - -Now we can focus on separating the genres into their individual genres. Remember that they are combined, using commas, in the format that we originally have. Here are the first few genres: - -[source,R] ----- -%%R -head(myDF$genres) ----- - -Now we split them according to the commas in each: - -[source,R] ----- -%%R -strsplit(head(myDF$genres), ",") ----- - -This will be new for many/most of you, but we can `unlist` them in R, so that they are not listed separately anymore, but instead, they are in one big vector. - -[source,R] ----- -%%R -unlist(strsplit(head(myDF$genres), ",")) ----- - -and now we can use `unique` to see a list of the genres, removing any duplications: - -[source,R] ----- -%%R -unique(unlist(strsplit(head(myDF$genres), ","))) ----- - -Since this works on the `head`, we can remove the `head` now, and see the 29 such genres. Notice that the 21st such genre is missing, i.e., it is empty, so we do not know the genres for some of the titles. - -[source,R] ----- -%%R -unique(unlist(strsplit(myDF$genres, ","))) ----- - -If we want to know how many times each genre appears, we can use the `table` function instead of the `unique` function. - -[source,R] ----- -%%R -table(unlist(strsplit(myDF$genres, ","))) ----- - -We can make a dotchart of those results - -[source,R] ----- -%%R -dotchart(table(unlist(strsplit(myDF$genres, ",")))) ----- - -and it would likely help to put the results in the dotchart into sorted order - -[source,R] ----- -%%R -dotchart(sort(table(unlist(strsplit(myDF$genres, ","))))) ----- - - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day5-notes-REEU.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day5-notes-REEU.adoc deleted file mode 100644 index d1ba07050..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-day5-notes-REEU.adoc +++ /dev/null @@ -1,109 +0,0 @@ -= REEU: Day 5 Notes -- 2023 - -We are using the `seminar-r` kernel today - -Loading the R `leaflet` library and `sf` library - -[source,R] ----- -library(leaflet) -library(sf) ----- - -and setting options for the display in Jupyter Lab: - -[source,R] ----- -options(jupyter.rich_display = T) ----- - -Here are three sample points: - -[source,R] ----- -testDF <- data.frame(c(40.4259, 41.8781, 39.0792), c(-86.9081, -87.6298, -84.17704)) ----- - -Let's name the columns as `lat` and `long` - -[source,R] ----- -names(testDF) <- c("lat", "long") ----- - -Now we can define the points to plot: - -[source,R] ----- -points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326) ----- - -and render the map: - -[source,R] ----- -addCircleMarkers(addTiles(leaflet( testDF )), radius=1) ----- - -== Craigslist example - -Now we can try this with Craigslist data - -First we load the `data.table` library - -[source,R] ----- -library(data.table) ----- - -Now we read in some Craigslist data. This takes some time: - -[source,R] ----- -myDF <- fread("/anvil/projects/tdm/data/craigslist/vehicles.csv", - stringsAsFactors = TRUE) ----- - -We can look at the head of the data: - -[source,R] ----- -head(myDF) ----- - -and the names of the variables in the data: - -[source,R] ----- -names(myDF) ----- - -Here are the Craiglist listings from Indiana: - -[source,R] ----- -indyDF <- subset(myDF, state=="in") ----- - -and we want to make sure that the `long` and `lat` values are not missing: - -[source,R] ----- -testDF <- indyDF[ (!is.na(indyDF$long)) & - (!is.na(indyDF$lat))] ----- - -Now we set the points to be plotted: - -[source,R] ----- -points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326) ----- - -and we draw the map: - -[source,R] ----- -addCircleMarkers(addTiles(leaflet( testDF )), radius=1) ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-01-REEU-and-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-01-REEU-and-think-summer.adoc deleted file mode 100644 index 1cabd54f9..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-01-REEU-and-think-summer.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= Think Summer: Project 1 -- 2023 - -== How many people (from the people table) were born in each year? In which year were the most people born? - -There are plenty of erroneous years in the born variable, so we limit our results to years after 1750 (we chose this arbitrarily): - -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born >= 1750 GROUP BY born; ----- - -The most births in any year is 8882 births in the year 1980: - -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born >= 1750 GROUP BY born HAVING COUNT(*) > 8800; ----- - - - -== How many episodes did the show Sopranos have? Pick another TV show; how many episodes did this show have? - -The Sopranos has 86 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0141842'; ----- - -Friends has 235 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0108778'; ----- - -Downton Abbey has 52 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt1606375'; ----- - - - - -== How many crews include George Lucas? Pick your own favorite director, actor, or actress: How many crews include that person? - -George Lucas is a member of 761 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0000184'; ----- - -Tobey Maguire is a member of 154 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0001497'; ----- - -Jennifer Aniston is a member of 992 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0000098'; ----- - - - -== How many titles have 1 million or more ratings? How many titles have 50,000 or fewer ratings? - -There are 48 titles with 1 million or more ratings. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM ratings WHERE votes >= 1000000; ----- - -There are 1166146 titles with 50000 or fewer ratings. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM ratings WHERE votes <= 50000; ----- - - - - -== In what year was the actress Gal Gadot born? How about Mark Hamill? How about your favorite actor or actress? - -Gal Gadot was born in 1985. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm2933757'; ----- - -Mark Hamill was born in 1951. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0000434'; ----- - -Jennifer Aniston was born in 1969. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0000098'; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-REEU.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-REEU.adoc deleted file mode 100644 index 164f0607b..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-REEU.adoc +++ /dev/null @@ -1,40 +0,0 @@ -= REEU: Project Day 2 -- 2023 - -== How many crews include George Lucas in the role of `director`? - - - -== How many shows have more than 10000 episodes? Hint: Use the `episodes` table, and `GROUP BY` the `show_title_id` and use the condition `HAVING COUNT(*) > 10000` at the end of the query. - - -== What are the 3 most popular episodes of Friends? Please include the title of each episode. Please verify your answer by double-checking with IMDB. - -(By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us!) - -Hint: Friends has `show_title_id = tt0108778`. - -Another hint: When you join the `episodes` table and the `ratings` table, you might want to add the condition `e.episode_title_id = r.title_id` - -Another hint: You might want to have `ORDER BY r.rating DESC LIMIT 3` at the end of your query, so that you are ordering the results by the ratings, and putting them in descending order (with the biggest at the top). - - -== Identify the 6 movies that have rating 9 or higher and have 50000 or more votes. - - -== For how many movies has Sean Connery been on the `crew`? - - -== Revisiting the question about George Lucas: What are the titles of the movies in which George Lucas had the role of `director`? - - -== Revisiting the question about the shows that have more than 10000 episodes, please find the `primary_title` of these shows. - - - -== Revisiting the question about how many movies has Sean Connery been on the `crew`: Which movie is his most popular movie? - - -== Again revisiting the previous question about the popularity of movies for which Sean Connery has been on the `crew`: Which movie is his most popular movie, if we limit our query to results with 1000 or more votes? What is the title of the most popular result? - -(For this question, use `rating` for popularity, i.e., please focus on high `rating` values.) - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-think-summer.adoc deleted file mode 100644 index e01d35517..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-02-think-summer.adoc +++ /dev/null @@ -1,192 +0,0 @@ -= Think Summer: Project 2 -- 2023 - -== How many crews include George Lucas in the role of `director`? - -George Lucas is in the role of director in 18 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0000184' AND category = 'director'; ----- - -Alternatively, if we do not want to manually lookup George Lucas's `person_id` value, we could solve this one as follows: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew AS c JOIN people AS p ON c.person_id = p.person_id WHERE p.name = 'George Lucas' and c.category = 'director'; ----- - - - - - -== How many shows have more than 10000 episodes? Hint: Use the `episodes` table, and `GROUP BY` the `show_title_id` and use the condition `HAVING COUNT(*) > 10000` at the end of the query. - -There are three shows that each have more than 10000 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*), show_title_id FROM episodes GROUP BY show_title_id HAVING COUNT(*) > 10000 LIMIT 5; ----- - - - - - -== What are the 3 most popular episodes of Friends? Please include the title of each episode. Please verify your answer by double-checking with IMDB. - -(By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us!) - -Hint: Friends has `show_title_id = tt0108778`. - -Another hint: When you join the `episodes` table and the `ratings` table, you might want to add the condition `e.episode_title_id = r.title_id` - -Another hint: You might want to have `ORDER BY r.rating DESC LIMIT 3` at the end of your query, so that you are ordering the results by the ratings, and putting them in descending order (with the biggest at the top). - - -[source,sql] ----- -%%sql -SELECT * FROM episodes AS e JOIN ratings AS r ON e.episode_title_id = r.title_id WHERE show_title_id = 'tt0108778' ORDER BY r.rating DESC LIMIT 3; ----- - -If we want to know the titles of the episodes, then we can `JOIN` with the `titles` table. - -[source,sql] ----- -%%sql -SELECT primary_title, season_number, episode_number, rating, votes -FROM episodes AS e -JOIN ratings AS r ON e.episode_title_id = r.title_id -JOIN titles AS t ON e.episode_title_id = t.title_id -WHERE show_title_id = 'tt0108778' -ORDER BY r.rating DESC LIMIT 3; ----- - - - - - -== Identify the 6 movies that have rating 9 or higher and have 50000 or more votes. - -We `JOIN` the `titles` and the `ratings` tables, with the conditions on the `type` and `votes` and `rating`, and we get the required 6 movies. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN ratings AS r ON t.title_id = r.title_id WHERE t.type = 'movie' AND r.votes > 50000 and r.rating >= 9 ORDER BY r.rating DESC LIMIT 10; ----- - - - - - - -== For how many movies has Sean Connery been on the `crew`? - -We `JOIN` the `crew` table and the `people` table and the `titles` table, and we discover that Sean Connery was in 69 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew AS c -JOIN people AS p ON c.person_id = p.person_id -JOIN titles AS t ON c.title_id = t.title_id -WHERE p.name = 'Sean Connery' AND t.type = 'movie'; ----- - - - - - -== Revisiting the question about George Lucas: What are the titles of the movies in which George Lucas had the role of `director`? - -We can include a `JOIN` with the `titles` table, so that we know the titles; here, we are limiting the titles to movies so we only have 6 titles, instead of the original 18 titles: - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c JOIN titles AS t ON c.title_id = t.title_id WHERE c.person_id = 'nm0000184' AND c.category = 'director' AND t.type = 'movie'; ----- - - - - - - -== Revisiting the question about the shows that have more than 10000 episodes, please find the `primary_title` of these shows. - -We can `JOIN` with the titles table, if we want to get the names of the shows. - -[source,sql] ----- -%%sql -SELECT COUNT(*), show_title_id, t.primary_title FROM episodes AS e JOIN titles AS t ON e.show_title_id = t.title_id GROUP BY show_title_id HAVING COUNT(*) > 10000 LIMIT 5; ----- - - - - - - - - - -== Revisiting the question about how many movies has Sean Connery been on the `crew`: Which movie is his most popular movie? - -(By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us!) - -We can now also `JOIN` the `rating` table, and we see that Sean Connery's highest ratest movie, if we consider the `rating` (rather than the `votes`) is `Ever to Excel` - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c -JOIN people AS p ON c.person_id = p.person_id -JOIN titles AS t ON c.title_id = t.title_id -JOIN ratings AS r ON c.title_id = r.title_id -WHERE p.name = 'Sean Connery' AND t.type = 'movie' -ORDER BY r.rating DESC LIMIT 1; ----- - -but if we instead consider the number of votes, then his highest ratest movie is `Indiana Jones and the Last Crusade` - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c -JOIN people AS p ON c.person_id = p.person_id -JOIN titles AS t ON c.title_id = t.title_id -JOIN ratings AS r ON c.title_id = r.title_id -WHERE p.name = 'Sean Connery' AND t.type = 'movie' -ORDER BY r.votes DESC LIMIT 1; ----- - - - - - - - - -== Again revisiting the previous question about the popularity of movies for which Sean Connery has been on the `crew`: Which movie is his most popular movie, if we limit our query to results with 1000 or more votes? What is the title of the most popular result? - -(For this question, use `rating` for popularity, i.e., please focus on high `rating` values.) - -Adding the condition `r.votes >= 1000`, we see that `Indiana Jones and the Last Crusade` is the most popular by this measure. - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c -JOIN people AS p ON c.person_id = p.person_id -JOIN titles AS t ON c.title_id = t.title_id -JOIN ratings AS r ON c.title_id = r.title_id -WHERE p.name = 'Sean Connery' AND t.type = 'movie' -AND r.votes >= 1000 -ORDER BY r.rating DESC LIMIT 1; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-REEU.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-REEU.adoc deleted file mode 100644 index 81bea00e3..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-REEU.adoc +++ /dev/null @@ -1,42 +0,0 @@ -= REEU: Project Day 3 -- 2023 - -=== Question 1 - -How many flights departed from O'Hare (`ORD`) and landed in Denver (`DEN`)? - -=== Question 2 - -What is the Distance (in miles) of any individual flight from O'Hare to Denver? - -=== Question 3 - -Consider only the flights that arrive to Honolulu (airport code `HNL`), i.e., for which Honolulu is the destination. What are the 10 most popular origin airports? - -=== Question 4 - -Each airplane has a unique `TailNum`. Which airplane flew the most flights in 2005? Hint: If you strictly look at the `TailNum` values, you will need to ignore the top two results, because they are missing data. - -=== Question 5 - -Which airplane flew the largest number of times from O'Hare to Denver? Hint: Again, if you strictly look at the `TailNum` values, you will need to ignore the top result, because it has missing data. - -=== Question 6 - -What were the 10 most popular days of the year to fly in 2005 (where "popularity" is evaluated by the number of flights on that day)? Hint: You might paste together the year, month, and day of the flights. - -=== Question 7 - -In which month are the average departure delays the worst? Hint: You might use a tapply function. - -=== Question 8 - -Make a `dotchart` that illustrates the data from the previous question (about the average flight delays in each month). - -=== Question 9 - -The `UniqueCarrier` is the airline carrier for the flight. If you add all of the miles flown (in the `Distance`) for each airline carrier, which carrier flew the most miles altogether in 2005? Hint: You might use a tapply function. - -=== Question 10 - -Create your own interesting question about the 2005 flight data. What insights can you find? - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-think-summer.adoc deleted file mode 100644 index f32dfa0fc..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-03-think-summer.adoc +++ /dev/null @@ -1,119 +0,0 @@ -= Think Summer: Project 3 -- 2023 - -== Submission - -Students need to submit the following file **by 10:00PM EST** through Gradescope inside Brightspace. - -. A Jupyter notebook (a `.ipynb` file). - -We've provided you with a template notebook for you to use. Please carefully read xref:summer2023summer-2023-project-template.adoc[this section] to get started. - -[CAUTION] -==== -When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click menu:Run[Run All Cells]. Next, to export your notebook (your `.ipynb` file), click on menu:File[Download], and download your `.ipynb` file. -==== - -== Questions - -=== Question 1 - -How many flights departed from O'Hare (`ORD`) and landed in Denver (`DEN`)? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 2 - -What is the Distance (in miles) of any individual flight from O'Hare to Denver? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 3 - -Consider only the flights that arrive to Honolulu (airport code `HNL`), i.e., for which Honolulu is the destination. What are the 10 most popular origin airports? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 4 - -Each airplane has a unique `TailNum`. Which airplane flew the most flights in 2005? Hint: If you strictly look at the `TailNum` values, you will need to ignore the top two results, because they are missing data. - - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 5 - -Which airplane flew the largest number of times from O'Hare to Denver? Hint: Again, if you strictly look at the `TailNum` values, you will need to ignore the top result, because it has missing data. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 6 - -What were the 10 most popular days of the year to fly in 2005 (where "popularity" is evaluated by the number of flights on that day)? Hint: You might paste together the year, month, and day of the flights. - - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 7 - -In which month are the average departure delays the worst? Hint: You might use a tapply function. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 8 - -Make a `dotchart` that illustrates the data from the previous question (about the average flight delays in each month). - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 9 - -The `UniqueCarrier` is the airline carrier for the flight. If you add all of the miles flown (in the `Distance`) for each airline carrier, which carrier flew the most miles altogether in 2005? Hint: You might use a tapply function. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 10 - -Create your own interesting question about the 2005 flight data. What insights can you find? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-REEU.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-REEU.adoc deleted file mode 100644 index df48a5fa0..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-REEU.adoc +++ /dev/null @@ -1,111 +0,0 @@ -= REEU: Project 4 -- 2023 - -=== Question 1 - -For the show The Handmaid's Tale (`title_id` is tt5834204), there are 4 seasons listed in the IMDB database. Find the average rating of each of the four seasons. Hint: Use `AVG` for find the average, and use `GROUP BY` the `season_number`. - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 2 - -Identify the six most popular episodes of the show Grey's Anatomy (where "popular" denotes a high rating). - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 3 - -Make a dotchart in R showing the results of the previous question. -Hint: You can use your work from SQL, and export the results to a dataframe called `myDF` in R. Then you can use something like: - -[source,R] ----- -# use a dbGetQuery here, to import the SQL results to R, and then -myresults <- myDF$rating -names(myresults) <- myDF$primary_title -dotchart(myresults) ----- - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 4 - -Make a dotchart showing the total amount of money donated in each of the top 10 states, during the 2000 federal election cycle. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 5 - -Make a dotchart that shows how many movies premiered in each year. Now make another dotchart, which shows the same data (i.e., how many movies premiered each year) but only since the year 2000. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 6 - -Among the three big New York City airports (`JFK`, `LGA`, `EWR`), which of these airports had the worst `DepDelay` (on average) in 2005? (Can you solve this with 1 line of R, using a `tapply` (rather than using 3 separate lines of R)? Hint: After you run the `tapply`, you can index your results using `[c("JFK", "LGA", "EWR")]` to lookup all 3 airports at once.) - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 7 - -`LIKE` is a very powerful tool. You can read about SQLite's version of `LIKE` https://www.w3resource.com/sqlite/core-functions-like.php[here]. Use `LIKE` to analyze the `primary_title` of all IMDB titles: First determine how many titles have `Batman` anywhere in the title, and then determine how many titles have `Superman` anywhere in the title? Which one occurs more often? - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 8 - -How much money was donated during the 2000 federal election cycle by people who have `PURDUE` listed somewhere in their employer name? How much money was donated by people who have `MICROSOFT` listed somewhere in their employer name? Hint: You might use the `grep` or the `grepl` (which is a logical grep) to solve this one. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 9 - -How much money was donated during the 2000 federal election cycle by people from your hometown? (Be sure to match the city and the state.) - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 10 - -Create your own interesting question based on the things you have learned during these two weeks with Dr Ward. What insights can you find? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-think-summer.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-think-summer.adoc deleted file mode 100644 index ca591792c..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-04-think-summer.adoc +++ /dev/null @@ -1,126 +0,0 @@ -= Think Summer: Project 4 -- 2023 - -== Submission - -Students need to submit the following file **by 10:00PM EST** through Gradescope inside Brightspace. - -. A Jupyter notebook (a `.ipynb` file). - -We've provided you with a template notebook for you to use. Please carefully read xref:summer2023/summer-2023-project-template.adoc[this section] to get started. - -[CAUTION] -==== -When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click menu:Run[Run All Cells]. Next, to export your notebook (your `.ipynb` file), click on menu:File[Download], and download your `.ipynb` file. -==== - -== Questions - -=== Question 1 - -For the show The Handmaid's Tale (`title_id` is tt5834204), there are 4 seasons listed in the IMDB database. Find the average rating of each of the four seasons. Hint: Use `AVG` for find the average, and use `GROUP BY` the `season_number`. - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 2 - -Identify the six most popular episodes of the show Grey's Anatomy (where "popular" denotes a high rating). - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 3 - -Make a dotchart in R showing the results of the previous question. -Hint: You can use your work from SQL, and export the results to a dataframe called `myDF` in R. Then you can use something like: - -[source,R] ----- -# use a dbGetQuery here, to import the SQL results to R, and then -myresults <- myDF$rating -names(myresults) <- myDF$primary_title -dotchart(myresults) ----- - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 4 - -Make a dotchart showing the total amount of money donated in each of the top 10 states, during the 2000 federal election cycle. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 5 - -Make a dotchart that shows how many movies premiered in each year. Now make another dotchart, which shows the same data (i.e., how many movies premiered each year) but only since the year 2000. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 6 - -Among the three big New York City airports (`JFK`, `LGA`, `EWR`), which of these airports had the worst `DepDelay` (on average) in 2005? (Can you solve this with 1 line of R, using a `tapply` (rather than using 3 separate lines of R)? Hint: After you run the `tapply`, you can index your results using `[c("JFK", "LGA", "EWR")]` to lookup all 3 airports at once.) - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 7 - -`LIKE` is a very powerful tool. You can read about SQLite's version of `LIKE` https://www.w3resource.com/sqlite/core-functions-like.php[here]. Use `LIKE` to analyze the `primary_title` of all IMDB titles: First determine how many titles have `Batman` anywhere in the title, and then determine how many titles have `Superman` anywhere in the title? Which one occurs more often? - -.Items to submit -==== -- SQL used to solve this problem. _(2 pts)_ -- Output from running SQL. _(1 pt)_ -==== - -=== Question 8 - -How much money was donated during the 2000 federal election cycle by people who have `PURDUE` listed somewhere in their employer name? How much money was donated by people who have `MICROSOFT` listed somewhere in their employer name? Hint: You might use the `grep` or the `grepl` (which is a logical grep) to solve this one. - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 9 - -How much money was donated during the 2000 federal election cycle by people from your hometown? (Be sure to match the city and the state.) - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - -=== Question 10 - -Create your own interesting question based on the things you have learned this week. What insights can you find? - -.Items to submit -==== -- R used to solve this problem. _(2 pts)_ -- Output from running R. _(1 pt)_ -==== - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-introduction.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-introduction.adoc deleted file mode 100644 index 8493b7d9b..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-introduction.adoc +++ /dev/null @@ -1,137 +0,0 @@ -= Think Summer: Introduction -- 2023 - -== Submission - -Students need to submit the following file **by 10:00PM EST** through Gradescope inside Brightspace. - -. A Jupyter notebook (a `.ipynb` file). - -We've provided you with a template notebook for you to use. Please carefully read xref:summer2023/summer-2023-project-template.adoc[this section] to get started. - -[CAUTION] -==== -When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click menu:Run[Run All Cells]. Next, to export your notebook (your `.ipynb` file), click on menu:File[Download], and download your `.ipynb` file. -==== - -== Project - -**Motivation:** SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data -- amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses. - -**Context:** This project is specially designed for Purdue University's Think Summer program, and is coordinated by https://datamine.purdue.edu/[The Data Mine]. - -**Scope:** SQL, SQL in R - -.Learning Objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Use basic SQL commands: select, order by, limit, desc, asc, count, where, from. -- Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having. -**** - -== Dataset - -The following questions will use the `imdb` database found in Anvil, our computing cluster. - -This database has 6 tables, namely: - -`akas`, `crew`, `episodes`, `people`, `ratings`, and `titles`. - -You have a variety of options to connect with, and run queries on our database: - -. Run SQL queries directly within a Jupyter Lab cell. -. Connect to and run queries from within R in a Jupyter Lab cell. -. From a terminal in Anvil. - -For consistency and simplicity, we will only cover how to do (1) and (2). - -First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below. - -. Open a browser and navigate to https://ondemand.anvil.rcac.purdue.edu, and login using your ACCESS credentials. You should be presented with a screen similar to figure (1). -+ -image::figure08.webp[OnDemand, width=792, height=500, loading=lazy, title="OnDemand"] -+ -. Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2). -+ -image::figure09.webp[Your interactive Anvil sessions, width=792, height=500, loading=lazy, title="Your interactive Anvil sessions"] -+ -. Click on Jupyter Notebook in the left-hand menu **under "The Data Mine" section**. You should be presented with a screen similar to figure (3). Select the following settings: -+ -* Allocation: cis220051 -* Queue: shared -* Time in Hours: 3 -* Cores: 1 -* Use Jupyter Lab instead of Jupyter Notebook: Checked -+ -image::figure10.webp[Jupyter Lab settings, width=792, height=500, loading=lazy, title="Jupyter Lab settings"] -+ -. When satisfied, click btn:[Launch], and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4). -+ -image::figure11.webp[Jupyter Lab ready to connect, width=792, height=500, loading=lazy, title="Jupyter Lab ready to connect"] -+ -. When you are ready, click btn:[Connect to Jupyter]. A new browser tab will launch and you will be presented with a screen similar to figure (5). -+ -image::figure12.webp[Kernel menu, width=792, height=500, loading=lazy, title="Kernel menu"] -+ -. Under the "Notebook" menu, please select the btn:[seminar] (look for the big "S"; we do not want btn:[seminar-r]). Finally, you will be presented with a screen similar to figure (6). -+ -image::figure13.webp[Ready Jupyter Lab notebook, width=792, height=500, loading=lazy, title="Ready-to-use Jupyter Lab notebook"] -+ -You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the https://anvil.rcac.purdue.edu[Anvil cluster]. By using OnDemand, you've essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you'd like to do <<option-1,option (1)>> or <<option-2,option (2)>>. - -[#option-1] -To run queries directly in a Jupyter Lab cell (1), please do the following. - -. In the first cell, run the following code. This code establishes a connection to the `imdb.db` database, which allows you to directly run SQL queries in a cell as long as that cell has `%%sql` at the top of the cell. -+ -[source, ipynb] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- -+ -. After running that cell (for example, using kbd:[Ctrl+Enter]), you can directly run future queries in each cell by starting the cell with `%%sql` in the first line. For example. -+ -[source, sql] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- -+ -While this method has its advantages, there are some advantages to having interop between R and SQL -- for example, you could quickly create cool graphics using data in the database and R. - -[#option-2] -To run queries from within R (2), please do the following. - -. You can directly run R code in any cell that starts with `%%R` in the first line. For example. -+ -[source,r] ----- -%%R - -my_vec <- c(1,2,3) -my_vec ----- -+ -Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example. -+ -[source,r] ----- -%%R - -library(RSQLite) -library(ggplot2) - -conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db") -myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;") - -ggplot(myDF) + - geom_point(aes(x=primary_title, y=runtime_minutes)) + - labs(x = 'Title', y= 'Minutes') ----- -+ -image::figure07.webp[R output, width=480, height=480, loading=lazy, title="R output"] - -[IMPORTANT] -It is perfectly acceptable to mix and match SQL cells and R cells in your project. - diff --git a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-template.adoc b/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-template.adoc deleted file mode 100644 index c9e16e73f..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2023/summer-2023-project-template.adoc +++ /dev/null @@ -1,100 +0,0 @@ -= Templates - -Our course project template can be found xref:attachment$think_summer_project_template.ipynb[here], or on Anvil: - -`/anvil/projects/tdm/etc/think_summer_project_template.ipynb` - -Students can use and modify this as a template as needed, for all project submissions. This template is a starting point for all projects. - -By default, the `seminar` kernel runs Python code. To run other types of code, see below. - -== Running `R` code using the `seminar` kernel - -[source,ipython] ----- -%%R - -my_vec <- c(1,2,3) -my_vec ----- - -As you can see, any cell that begins with `%%R` will run the R code in that cell. If a cell does not begin with `%%R`, it will be assumed that the code is Python code, and run accordingly. - -== Running SQL queries using the `seminar` kernel - -. First, you need to establish a connection with the database. If this is a sqlite database, you can use the following command. -+ -[source,ipython] ----- -%sql sqlite:///my_db.db -# or -%sql sqlite:////anvil/projects/tdm/data/path/to/my_db.db ----- -+ -Otherwise, if this is a mysql database, you can use the following command. -+ -[source,ipython] ----- -%sql mariadb+pymysql://username:password@my_url.com/my_database ----- -+ -. Next, to run SQL queries, in a new cell, run the following. -+ -[source,ipython] ----- -%%sql - -SELECT * FROM my_table; ----- - -As you can see, any cell that begins with `%%sql` will run the SQL query in that cell. If a cell does not begin with `%%sql`, it will be assumed that the code is Python code, and run accordingly. - -== Running `bash` code using the `seminar` kernel - -To run `bash` code, in a new cell, run the following. - -[source,bash] ----- -%%bash - -ls -la ----- - -As you can see, any cell that begins with `%%bash` will run the `bash` code in that cell. If a cell does not begin with `%%bash`, it will be assumed that the code is Python code, and run accordingly. - -[TIP] -==== -Code cells that start with `%` or `%%` are sometimes referred to as magic cells. To see a list of available magics, run `%lsmagic` in a cell. - -The commands listed in the "line" section are run with a single `%` and can be mixed with other code. For example, the following cell contains (in order) some Python code, uses a single line magic, followed by some more Python code. - -[source,ipython] ----- -import pandas as pd - -%time myDF = pd.read_parquet("/anvil/projects/tdm/data/whin/weather.parquet") - -myDF.head() ----- - -The commands listed in the "cell" section are run with a double `%%` and apply to the entire cell, rather than just a single line. For example, `%%bash` is an example of a cell magic. - -You can read more about some of the available magics in the https://ipython.readthedocs.io/en/stable/interactive/magics.html#[official documentation]. -==== - -== Including an image in your notebook - -To include an image in your notebook, use the following Python code. - -[source,python] ----- -from IPython import display -display.Image("./cloud.png") ----- - -Here, `./cloud.png` is the path to the image you would like to include. - -[IMPORTANT] -==== -If you choose to include an image using a Markdown cell, and the `![](...)` syntax, please note that while the notebook will render properly in our https://ondemand.anvil.rcac.purdue.edu environment, it will _not_ load properly in any other environment where that image is not available. For this reason it is critical to include images using the method shown here. -==== diff --git a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01-backup.adoc b/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01-backup.adoc deleted file mode 100644 index 26415e652..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01-backup.adoc +++ /dev/null @@ -1,125 +0,0 @@ -= Think Summer: Project 1 -- 2024 - - -== In what year was Nora Ephron born? How about Carrie Fisher? How about your favorite actor or actress? - -Nora Ephron was born in 1941. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0001188'; ----- - -Carrie Fisher was born in 1956. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0000402'; ----- - -Jennifer Aniston was born in 1969. - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0000098'; ----- - - -== In which year was the movie The Matrix created? (Hint: you might want to limit your results to type "movie".) - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE type = "movie" AND original_title = "The Matrix" LIMIT 5; ----- - - -== During the years 2000 to 2020, how many people (from the people table) died in each year? - -[source,sql] ----- -%%sql -SELECT COUNT(*), died FROM people WHERE died >= 2000 AND died <= 2020 GROUP BY died; ----- - - -== How many episodes did the show Gilmore Girls have? Pick another TV show; how many episodes did this show have? - -Gilmore Girls has 154 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = "tt0238784"; ----- - -Friends has 235 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0108778'; ----- - -Downton Abbey has 52 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt1606375'; ----- - - - - -== How many crews was Nora Ephron on? How many times was she a director? A writer? A cast member (playing herself)? Pick your own favorite director, actor, or actress: How many crews include that person? - -Nora Ephron has been a director 8 times, a writer 10 times, and has portrayed herself 53 times. - -[source,sql] ----- -%%sql -SELECT category, COUNT(*) FROM crew WHERE person_id = "nm0001188" GROUP BY category; ----- - -Tobey Maguire is a member of 154 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0001497'; ----- - -Jennifer Aniston is a member of 992 crews. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM crew WHERE person_id = 'nm0000098'; ----- - - - - - -== How many titles have Romance as one of the genres? In which year did the most titles appear, with Romance as one of the genres? - -There are 722613 titles with Romance as one of the genres. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles WHERE genres LIKE "%Romance%" LIMIT 5; ----- - -There were 26749 such titles in 2014. - -[source,sql] ----- -%%sql -SELECT premiered, COUNT(*) FROM titles WHERE genres LIKE "%Romance%" GROUP BY premiered HAVING COUNT(*) > 25000; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01.adoc b/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01.adoc deleted file mode 100644 index 7498c251b..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-01.adoc +++ /dev/null @@ -1,20 +0,0 @@ -= Think Summer: Project 1 -- 2024 - - -== In what year was Nora Ephron born? How about Carrie Fisher? How about your favorite actor or actress? - - -== In which year was the movie The Matrix created? (Hint: you might want to limit your results to type "movie".) - - -== During the years 2000 to 2020, how many people (from the people table) died in each year? - - -== How many episodes did the show Gilmore Girls have? Pick another TV show; how many episodes did this show have? - - -== How many crews was Nora Ephron on? How many times was she a director? A writer? A cast member (playing herself)? Pick your own favorite director, actor, or actress: How many crews include that person? - - -== How many titles have Romance as one of the genres? In which year did the most titles appear, with Romance as one of the genres? - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-02.adoc b/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-02.adoc deleted file mode 100644 index 1fbcc126e..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-02.adoc +++ /dev/null @@ -1,24 +0,0 @@ -= Think Summer: Project 2 -- 2024 - -== How many crews include Meryl Streep in the role of `actress`? In how many crews does she have the role of `self`? - -== In which years did more than 300000 titles have their premieres? Hint: Use the `titles` table, and `GROUP BY` the year `premiered` and use the condition `HAVING COUNT(*) > 300000` at the end of the query. - -== How many `tvSeries` have rating 9 or higher and have at least 1 million votes? - -== For how many movies has Julia Roberts been on the `crew`? - -== What are the 4 most popular episodes of the Sopranos? Please include the title of each episode. Please verify your answer by double-checking with IMDB. - -(By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us! Just be sure to explain what you did in your solution.) - -Hint: When you join the `episodes` table and the `ratings` table, you might want to add the condition `e.episode_title_id = r.title_id` so that you can get the titles of the episodes. - -== Revisiting the question about Meryl Streep: What are the titles in which Meryl Streep had the role of `actress` and the rating was 8.5 or higher? - -== Revisiting the question about the `tvSeries` that have rating 9 or higher and have at least 1 million votes: please find the `primary_title` of these `tvSeries`. - -== Revisiting the question about how many movies has Julia Roberts been on the `crew`: Which movie(s) is/are her most popular? - -(By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us! Again, just be sure to explain what you did in your solution.) - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-03.adoc b/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-03.adoc deleted file mode 100644 index 19059ef13..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/old-summer-2024-project-03.adoc +++ /dev/null @@ -1,43 +0,0 @@ -= Think Summer: Project 3 -- 2024 - -=== Question 1 - -Consider only the flights that arrive to Indianapolis (airport code `IND`), i.e., for which Indianapolis is the destination. What are the 10 most popular origin airports? - -=== Question 2 - -Each airplane has a unique `TailNum`. Which airplane flew the most flights in 2005? Hint: If you strictly look at the `TailNum` values, you will need to ignore the top two results, because they are missing data. - -=== Question 3 - -What is the Distance (in miles) of any individual flight from Seattle (`SEA`) to Los Angeles (`LAX`)? Hint: You might paste together the origin and destination airports of the flights. - -=== Question 4 - -How many flights departed from Seattle (`SEA`) and landed in Los Angelex (`LAX`)? - -=== Question 5 - -Which airplane flew the most times from Seattle to Los Angeles? Hint: Again, if you strictly look at the `TailNum` values, you will need to ignore the top result, because it has missing data. - -=== Question 6 - -How many flights occurred on Independence Day in 2005? Hint: You might paste together the month and day of the flights. - -=== Question 7 - -In which month are the average departure delays the worst? Hint: You might use a tapply function. - -=== Question 8 - -Make a `dotchart` that illustrates the data from the previous question (about the average flight delays in each month). - -=== Question 9 - -Which airline carrier had the most flights from Cincinnati to Atlanta in 2005? (The `UniqueCarrier` is the airline carrier for the flight.) - -=== Question 10 - -Create your own interesting question about the 2005 flight data. What insights can you find? - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-account-setup.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-account-setup.adoc deleted file mode 100644 index 30c35f092..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-account-setup.adoc +++ /dev/null @@ -1,56 +0,0 @@ -= Purdue Summer College for High School Students: Setting Up Accounts -- 2024 - -== Creating an ACCESS account - -Students need to create an account on the https://www.rcac.purdue.edu/compute/anvil[Anvil high performance computing cluster]. - -The starting point for setting up an account is https://identity.access-ci.org/new-user - -This video will walk you through the required steps to create your account: - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_0ejtddfn&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -== Share your username with The Data Mine team - -Our team will not know which username you were assigned, unless you tell us. Please put your name, the email address that you used to setup your ACCESS account, and the ACCESS username that you were just assigned, at this URL: https://purdue.ca1.qualtrics.com/jfe/form/SV_23G64aAAKNshTrE - -This video will walk you through the required steps to share your name, email address, and assigned username: - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_sdshw2u3&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -== Setup two-factor authentication for logging in - -To setup the two-factor authentication on your phone, log in to the ACCESS portal for the first time, using the username that you were just assigned, and the password that you selected, at this URL: https://ondemand.anvil.rcac.purdue.edu - -This video will walk you through the required steps to setup your two-factor authentication: - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_vgi5ms92&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -== Please bring a laptop to campus for the Purdue Summer College for High School Students - -The sessions at Purdue will be hands-on and immersive. Students will benefit from having a laptop to type on, rather than (say) an Apple iPad or Galaxy Tablet. The laptop does not need to be new or fancy. The laptop simply enables students to login to the high performance computing cluster that students will use throughout the sessions. - -Thanks for considering! - -Our team at The Data Mine looks forward to working with the students again this summer! - -Any questions? Please write to: mdw@purdue.edu - -Mark Daniel Ward, Ph.D. + -Professor of Statistics and + -(by courtesy) of Agricultural & Biological Engineering, + -Computer Science, Mathematics, and Public Health; + -Executive Director of The Data Mine + -Purdue University + -mdw@purdue.edu + - - - - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day1-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day1-notes.adoc deleted file mode 100644 index 62f4f2d1f..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day1-notes.adoc +++ /dev/null @@ -1,279 +0,0 @@ -= Think Summer: Day 1 Notes -- 2024 - -== Loading the database - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - -== Extracting a few rows from the each of the 6 tables - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM akas LIMIT 5; ----- - -== We can see how many rows were in each table, as follows: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM people LIMIT 5; ----- - -== We can also start to investigate individual people, for instance: - -We can look Jack Black up, by his username: - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0085312' LIMIT 5; ----- - -Or we can lookup people by their name directly: - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Jack Black' LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Ryan Reynolds' LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name = 'Hayden Christensen' LIMIT 5; ----- - -== Community is a show that ran from 2009 to 2015 - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt1439629' LIMIT 5; ----- - -== Friends is one of Dr Ward's favorite shows. We can find it here: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0108778' LIMIT 5; ----- - -or like this: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5; ----- - -These are the episodes from Friends: - -[source,sql] ----- -%%sql -SELECT * FROM episodes WHERE show_title_id = 'tt0108778' LIMIT 5; ----- - -and one particular episode is called "The One Where Chandler Doesn't Like Dogs" - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0583431' LIMIT 5; ----- - -That episode is in season 7, episode 8: - -[source,sql] ----- -%%sql -SELECT * FROM episodes WHERE episode_title_id = 'tt0583431' LIMIT 5; ----- - -Here is the breakdown of the number of episodes per season: - -[source,sql] ----- -%%sql -SELECT COUNT(*), season_number FROM episodes -WHERE show_title_id = 'tt0108778' GROUP BY season_number LIMIT 15; ----- - - -== Tobey Maguire was born in 1975 - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE person_id = 'nm0001497' LIMIT 5; ----- - -== There are a total of 8064259 titles in the titles table. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles LIMIT 5; ----- - -== These are the first 5 people in the people table. - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -== These are the first 5 episodes in the episodes table. - -[source,sql] ----- -%%sql -SELECT * FROM episodes LIMIT 5; ----- - -== These are the first 5 people in the crew table. - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -== Only 3 movies have more than 2 million ratings - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE votes > 2000000 LIMIT 5; ----- - -== Let's find how many people were born in each year (after 1850). - -[source,sql] ----- -%%sql -SELECT COUNT(*), born FROM people WHERE born > 1850 -GROUP BY born LIMIT 200; ----- - -== There are 487731 titles with rating 7.4 or higher. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM ratings WHERE rating >= 7.4 LIMIT 5; ----- - - -== The Family Guy has 374 episodes. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0182576' LIMIT 5; ----- - -== These are five of the films where George Lucas was on the crew. - -[source,sql] ----- -%%sql -SELECT * FROM crew WHERE person_id = 'nm0000184' LIMIT 5; ----- - -These are the number of times that he played each role in the crew: - -[source,sql] ----- -%%sql -SELECT COUNT(*), category FROM crew WHERE person_id = 'nm0000184' GROUP BY category LIMIT 50; ----- - - -== We can investigate how many titles premiered in each year, by grouping things together according to the year that the title premiered, and by ordering the results according to the year that the title premiered. The "desc" specifies that we want the results in descending order, i.e., with the largest result first (where "largest" means the "last year", because we are ordering by the years). - -[source,sql] ----- -%%sql -SELECT COUNT(*), premiered FROM titles -GROUP BY premiered ORDER BY premiered DESC LIMIT 20; ----- - -== The Family Guy premiered in 1999 and ended in 2022. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0182576' LIMIT 5; ----- - -== If you want to find the first five Comedies, you can find the ones where the genres are like Comedy, possibly with some other characters before and after: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE genres LIKE '%Comedy%' LIMIT 5; ----- - -Similarly, you can find actors and actresses with Audrey in their name: - -[source,sql] ----- -%%sql -SELECT * FROM people WHERE name LIKE '%Audrey%' LIMIT 5; ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day2-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day2-notes.adoc deleted file mode 100644 index decf9ab37..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day2-notes.adoc +++ /dev/null @@ -1,414 +0,0 @@ -= Think Summer: Day 2 Notes -- 2024 - -== Loading the database - -[source,sql] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- - - -== Find and print the `title_id`, `rating`, and number of votes (`votes`) for all movies that received at least 2 million votes. -In a second query (and new cell), use the information you found in the previous query to identify the `primary_title` of these movies. - -These are the movies with at least 2 million votes: - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE votes >= 2000000 LIMIT 5; ----- - -and then we can lookup their titles: - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE title_id = 'tt0111161' OR title_id = 'tt0468569' OR title_id = 'tt1375666' LIMIT 5; ----- - -Later today, we will learn an easier way to find the titles of the movies, by learning how to `JOIN` the information in two or more tables. - - - -== How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years. - -We use the condition that `died-born` is bigger than 115 - -[source,sql] ----- -%%sql -SELECT *, died-born FROM people WHERE died-born > 115 LIMIT 10; ----- - -Now we can use the `COUNT` function to see that there are 7 such actors who lived more than 115 years. - -[source,sql] ----- -%%sql -SELECT COUNT(died-born) FROM people WHERE died-born > 115 LIMIT 5; ----- - - -== Use the `ratings` table to discover how many films have a rating of at least 8 and at least 50000 votes. In a separate cell, show 15 rows with this property. - -We can use conditions to ensure that rating and votes are large enough, -and then we can display 15 such results. - -[source,sql] ----- -%%sql -SELECT * FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15; ----- - -Then we can use the `COUNT` function to see that there are 670 such titles altogether. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15; ----- - - - - -== Find the `primary_title` of every _movie_ that is over 2 hours long or that premiered after 1990. Order the result from newest premiered year to oldest, and limit the output to 15 movies. Make sure `premiered` and `runtime_minutes` are not `NULL`. After displaying these 15 movies, run the query again in a second cell, but this time only display the number of such movies. - -We just add the conditions to the query about the titles table. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (type == 'movie') AND (runtime_minutes IS NOT NULL) AND (premiered IS NOT NULL) AND ((runtime_minutes > 120) OR (premiered > 1990)) ORDER BY premiered DESC LIMIT 15; ----- - -Now we can find the total number of such movies, using the `COUNT`: - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles WHERE (type == 'movie') AND (runtime_minutes IS NOT NULL) AND (premiered IS NOT NULL) AND ((runtime_minutes > 120) OR (premiered > 1990)) ORDER BY premiered DESC LIMIT 15; ----- - -This can be a helpful time to mention the concept of https://stackoverflow.com/questions/45231487/order-of-operation-for-and-and-or-in-sql-server-queries[order of operations] - -== What movie has the longest primary title? Answer this question using just SQL. - -You can read more about https://www.w3resource.com/sqlite/core-functions-length.php[SQLite length] - -We can use the `length` function, as follows: - -[source,sql] ----- -%%sql -SELECT *, length(primary_title) FROM titles ORDER BY length(primary_title) DESC LIMIT 5; ----- - -== What actor has the longest name? Answer this question using just SQL. - -[source,sql] ----- -%%sql -SELECT *, length(name) FROM people ORDER BY length(name) DESC LIMIT 5; ----- - - - - - -== Avoiding `NULL` values, and making calculations within our SQL queries - -We can start by loading the `titles` table. - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -and then making sure that we avoid rows in which `premiered` is `NULL` and the rows in which `ended` is `NULL`. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) LIMIT 5; ----- - -Then we can calculate the difference between the year that the show `ended` and the year that the show `premiered`. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) LIMIT 5; ----- - -We can given this new variable a name. For instance, we might use `mylength` to refer to the show's run on TV (in years). Then we can order the results by `mylength` in years, given in `DESC` (descending) order. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered AS mylength FROM titles WHERE (premiered IS NOT NULL) - AND (ended IS NOT NULL) ORDER BY mylength DESC LIMIT 5; ----- - -For instance, this allows us to see that the show `Allen and Kendal` was running from 1940 to 2015, for a total of 75 years. - -== How long was Friends on TV? - -We can use the query above as a starting point, just looking up `Friends` as the title, and seeing which shows with that title were on TV after 1993. We see that `Friends` was on TV for 10 years. - -[source,sql] ----- -%%sql -SELECT *, ended-premiered AS mylength FROM titles -WHERE (premiered IS NOT NULL) AND (ended IS NOT NULL) -AND (primary_title = 'Friends') AND (premiered > 1993) LIMIT 5; ----- - -== How many types of titles are there? - -Here are a few of the types of titles - -[source,sql] ----- -%%sql -SELECT type FROM titles LIMIT 5; ----- - -There are lots of repeats, so we ask for `DISTINCT` types, i.e., removing the repetitions. - -[source,sql] ----- -%%sql -SELECT DISTINCT type FROM titles LIMIT 5; ----- - -and now we can ask for a few more, i.e., we can increase the limit. - -[source,sql] ----- -%%sql -SELECT DISTINCT type FROM titles LIMIT 100; ----- - -Looks like there are 12 types altogether: `short`, `movie`, `tvShort`, `tvMovie`, `tvSeries`, `tvEpisode`, `tvMiniSeries`, `tvSpecial`, `video`, `videoGame` `radioSeries`, `radioEpisode` - -[source,sql] ----- -%%sql -SELECT COUNT(DISTINCT type) FROM titles LIMIT 100; ----- - -== How many times did each type occur? - -We can group the types and count each of them. For instance, there are 5897385 tvEpisodes and there are 581731 movies. - -[source,sql] ----- -%%sql -SELECT COUNT(*), type FROM titles GROUP BY type LIMIT 100; ----- - -== How many times did each genre occur? - -At first, we view the genres as tuples, for instance, `Action,Adult` is a genre (separated by commas). We can do this the same as we did above, just changing the variable type to the variable genres. - -[source,sql] ----- -%%sql -SELECT COUNT(*), genres FROM titles GROUP BY genres LIMIT 100; ----- - -Now we see that there are 2283 such genres: - -[source,sql] ----- -%%sql -SELECT COUNT(DISTINCT genres) FROM titles LIMIT 5; ----- - -[TIP] -==== -We will come back to the question above, about the total number of genres, when we learn how to import SQL queries into R dataframes. -==== - - -== How many times has The Awakening been used as a title? - -The Awakening has been used 131 times as a title - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles WHERE primary_title = 'The Awakening' LIMIT 5; ----- - - - - - -== Now we can learn about how to `JOIN` the results of queries from two or more tables. Using a `JOIN` is a powerful way to leverage lots of information from a database, but it takes a little time to set things up properly. First, we revisit a question from yesterday, about the movies that received at least 2 million votes. We want to find the titles of those movies. - -We will need the `titles` table and the `ratings` table. - -[source,sql] ----- -%%sql -SELECT * FROM titles LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM ratings LIMIT 5; ----- - -Now we join these two tables, and restrict the results to those movies with at least 2000000 votes. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN ratings AS r -ON t.title_id = r.title_id WHERE votes > 2000000 LIMIT 5; ----- - -== What was the most popular movie (highest rating) in the year your Mom or Dad or aunt, etc., was born? - -The most popular movie that premiered in 1940 was The Great Dictator, with a rating of 8.4. It is a Charlie Chaplin movie that criticizes the dictators of the time, who were becoming very powerful in Europe. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN ratings AS r ON t.title_id = r.title_id - WHERE (t.premiered = 1940) AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5; ----- - - - - -== How many episodes of Friends were there? - -We start by finding the `title_id` for Friends. - -[source,sql] ----- -%%sql -SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5; ----- - -So now we know that `tt0108778` is the `show_title_id` for Friends. - -Now we find the number of episodes per season. To do this, we first find the episodes for Friends. - -[source,sql] ----- -%%sql -SELECT * FROM episodes WHERE show_title_id = 'tt0108778' LIMIT 5; ----- - -and then we group them by `season_number`, to make sure that our results make sense. - -[source,sql] ----- -%%sql -SELECT COUNT(*), season_number FROM episodes WHERE show_title_id = 'tt0108778' GROUP BY season_number; ----- - -Season 10 differs from what I expected (I was guessing that there would be 18 episodes), so I checked further on this. - -[source,sql] ----- -%%sql -SELECT * FROM episodes AS e JOIN titles AS t ON e.episode_title_id = t.title_id WHERE show_title_id = 'tt0108778' AND season_number = 10 ORDER BY episode_number; ----- - -OK so they combined The Last One, which is two episodes, into just one listing. - -So there are 235 episodes listed, although there were actually 236 episodes in the show altogether! - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0108778'; ----- - - - - -== Who are the actors and actresses in the TV show Friends? - -We will need the `people` table and the `crew` table. - -[source,sql] ----- -%%sql -SELECT * FROM people LIMIT 5; ----- - -[source,sql] ----- -%%sql -SELECT * FROM crew LIMIT 5; ----- - -Now we join these two tables together. - -[source,sql] ----- -%%sql -SELECT * FROM crew AS c JOIN people AS p ON c.person_id = p.person_id LIMIT 5; ----- - -and now we also join with the `titles` table, and we focus on the `title_id` for Friends, which is `tt0108778`. There are 10 people listed, from the Friends TV show. - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id -JOIN people AS p ON c.person_id = p.person_id -WHERE t.title_id = 'tt0108778' LIMIT 50; ----- - -and 8 of them are actors or actresses - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id -JOIN people AS p ON c.person_id = p.person_id -WHERE (t.title_id = 'tt0108778') -AND ((c.category = 'actress') OR (c.category = 'actor')) LIMIT 50; ----- - -== How many movies has Emma Watson appeared in? - -She has appeared in a total of 18 movies. - -[source,sql] ----- -%%sql -SELECT COUNT(*) FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id - JOIN people AS p ON c.person_id = p.person_id - WHERE (p.name = 'Emma Watson') AND (t.type = 'movie'); ----- - - - -== James Caan died in 2022. You can read his https://en.wikipedia.org/wiki/James_Caan[Wikipedia page] or his https://www.imdb.com/name/nm0001001/[IMDB page]. What was his highest rated movie? - -He appeared in The Godfather, which has a rating of 9.2 - -[source,sql] ----- -%%sql -SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id - JOIN people AS p ON c.person_id = p.person_id - JOIN ratings AS r ON t.title_id = r.title_id - WHERE (p.name = 'James Caan') AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5; ----- - -== We also have xref:programming-languages:SQL:index.adoc[some additional information about SQL] posted in our book pages. - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day3-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day3-notes.adoc deleted file mode 100644 index aa96118bb..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day3-notes.adoc +++ /dev/null @@ -1,486 +0,0 @@ -= Think Summer: Day 3 Notes -- 2024 - -Loading the R `data.table` library and loading the data from the 2005 airline data set - -[source,R] ----- -%%R -library(data.table) ----- - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv") ----- - -A dataframe in R, by the way, is a lot like an Excel spreadsheet or a SQL table. A dataframe has columns of data, with one type of data in each column. The columns are usually long. In other words, there are usually many rows in the dataframe. - -These are the first few lines of the 2005 airline data set - -[source,R] ----- -%%R -head(myDF) ----- - -There are 7 million rows and 29 columns - -[source,R] ----- -%%R -dim(myDF) ----- - -The first few flights are departing from Boston or O'Hare - -[source,R] ----- -%%R -head(myDF$Origin) ----- - -The first few flights are arriving at Boston or O'Hare - -[source,R] ----- -%%R -head(myDF$Dest) ----- - -The last few flights are departing and arriving as follows: - -[source,R] ----- -%%R -tail(myDF$Origin) ----- - -[source,R] ----- -%%R -tail(myDF$Dest) ----- - -We can use `n=50` to get the destinations of the first 50 flights and the destinations of the last 50 flights. - -[source,R] ----- -%%R -head(myDF$Dest, n=50) ----- - -[source,R] ----- -%%R -tail(myDF$Dest, n=50) ----- - -If we find out how many times an airplane departed from each airport, we get these counts: - -[source,R] ----- -%%R -head(table(myDF$Origin), n=10) ----- - -Now we can sort those counts, in descending order (i.e., with the largest ones given first), and display the largest such 10 counts. - -[source,R] ----- -%%R -head(sort(table(myDF$Origin), decreasing=T), n=10) ----- - -Now we can display how many flights departed from each of the 10 most popular airports. - -[source,R] ----- -%%R -dotchart(head(sort(table(myDF$Origin), decreasing=T), n=10)) ----- - -We can extract the number of flights from specific airports, by looking the data up by the airports as indices. Note that we are only selecting from the 10 most popular airports here. - -[source,R] ----- -%%R -head(sort(table(myDF$Origin), decreasing=T), n=10)[c("ATL","CVG","ORD")] ----- - -Here is another example, in which we extract the number of flights from airports which may or may not be among the most popular 10 airports. - -[source,R] ----- -%%R -sort(table(myDF$Origin), decreasing=T)[c("EWR","IND","JFK","ORD")] ----- - -We can paste together the first 300 origin airports and the first 300 destination airports. - -[source,R] ----- -%%R -paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-") ----- - -Then we can tabulate how many times each such flight path was flown. - -[source,R] ----- -%%R -table(paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-")) ----- - -Now that this works, we can remove the heads on each of those data sets. Then we can tabulate the number of times that every flight path was used, and sort those results, and finally we can display the 100 most popular flight paths overall. - -[source,R] ----- -%%R -head(sort(table(paste(myDF$Origin, myDF$Dest, sep="-")), decreasing=T), n=100) ----- - -Now we can take these results and make a dotchart: - -[source,R] ----- -%%R -dotchart(head(sort(table(paste(myDF$Origin, myDF$Dest, sep="-")), decreasing=T), n=10)) ----- - -or alternatively: - -[source,R] ----- -%%R -dotchart(tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=10)) ----- - -We can extract the number of specific flights as well, for instance: - -[source,R] ----- -%%R -tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=20)["LAX-PHX"] ----- - -[source,R] ----- -%%R -tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=20)["SAN-LAX"] ----- - -In the example below, if we are still using the `tail` then we will not see the `IND-ORD` results, because that flight path is not among the most popular 20 flight paths. So we had to (instead) look at all of the flight paths. Note that, in such a case, we want to keep an index at the end (otherwise, we will be asking to see the tens of thousands of flight paths all over the whole country). - -[source,R] ----- -%%R -sort(table(paste(myDF$Origin, myDF$Dest, sep="-")))[c("SAN-LAX", "LAX-PHX", "IND-ORD")] ----- - -When we use the `table` function in R, in the result, we have a row of names followed by a row of data. Then we have another row of names followed by a row of data, etc., etc. R always displays data from a table in this way, namely, by alternating a row of names and a row of data. You can think about how things would look different (and easier) if your screen was really, really wide, and there were only two rows displayed, namely, the names and the data. - -These are the airline carriers for the first 6 flights. - -[source,R] ----- -%%R -head(myDF$UniqueCarrier) ----- - -We can see how many flights were flown with each carrier, in either increasing or decreasing order: - -[source,R] ----- -%%R -sort(table(myDF$UniqueCarrier), decreasing=T) ----- - -[source,R] ----- -%%R -sort(table(myDF$UniqueCarrier), decreasing=F) ----- - -The first few departure delays are: - -[source,R] ----- -%%R -head(myDF$DepDelay, n=100) ----- - -The overall average departure delay, across all flights, is 8.67 minutes: - -[source,R] ----- -%%R -mean(myDF$DepDelay, na.rm=T) ----- - -We can just restrict attention to the average departure delay for flights departing from `IND` or from `JFK`. - -[source,R] ----- -%%R -mean(myDF$DepDelay[myDF$Origin=="IND"], na.rm=T) ----- - -[source,R] ----- -%%R -mean(myDF$DepDelay[myDF$Origin=="JFK"], na.rm=T) ----- - -These are the first 100 departure delays for flights from Indianapolis to Chicago. - -[source,R] ----- -%%R -head(myDF$DepDelay[(myDF$Origin=="IND") & (myDF$Dest=="ORD")], n=100) ----- - -and the mean (or average) of those departure delays are: - -[source,R] ----- -%%R -mean(myDF$DepDelay[(myDF$Origin=="IND") & (myDF$Dest=="ORD")]) ----- - -The average departure delay on Christmas Day is: - -[source,R] ----- -%%R -mean( myDF$DepDelay[myDF$Year == 2005 & myDF$Month == 12 & myDF$DayofMonth == 25], na.rm=T) ----- - - -The first 6 department delays for flights from Boston or flights from Indianapolis are: - -[source,R] ----- -%%R -head(myDF$DepDelay[myDF$Origin == "BOS"]) ----- - -[source,R] ----- -%%R -head(myDF$DepDelay[myDF$Origin == "IND"]) ----- - -We could make a table of departure delays for flights from Indianapolis: - -[source,R] ----- -%%R -table(myDF$DepDelay[myDF$Origin == "IND"]) ----- - -and we can plot the distribution of departure delays: - -[source,R] ----- -%%R -plot(table(myDF$DepDelay[myDF$Origin == "IND"])) ----- - -and we can add conditions to this. For instance, if we only want to see the distribution of delays that are less than 1 hour: - -[source,R] ----- -%%R -plot(table(myDF$DepDelay[(myDF$Origin == "IND") & (myDF$DepDelay < 60)])) ----- - -If we look at the raw data (without plotting it), we need to be mindful that the output shows two rows: the departure delay, and then (below) that how many flights with that delay. - -[source,R] ----- -%%R -table(myDF$DepDelay[(myDF$Origin == "IND") & (myDF$DepDelay < 60)]) ----- - - - - -Now we switch gears and load the donation data from federal election campaigns in 2000. This data is described here: -https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/[Contributions by individuals file description] - - -[source,R] ----- -%%R -library(data.table) ----- - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/election/itcont2000.txt", quote="") ----- - -We need to provide the names of the columns in the data frame: - -[source,R] ----- -%%R -names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID") ----- - - -The first several rows of election data are: - -[source,R] ----- -%%R -head(myDF) ----- - -There are 1.6 million rows and 21 columns - -[source,R] ----- -%%R -dim(myDF) ----- - -The states where the last 100 donations were made are: - -[source,R] ----- -%%R -tail(myDF$STATE, n=100) ----- - -Altogether, there were 1.8 billion dollars in contributions - -[source,R] ----- -%%R -sum(myDF$TRANSACTION_AMT) ----- - -The largest number of contributions (regardless of the size of the contributions) were made by residents of `CA`, `NY`, `TX`, etc. - -[source,R] ----- -%%R -sort(table(myDF$STATE), decreasing=T) ----- - -We can paste the first 6 cities and the first 6 states together, using the `paste` function: - -[source,R] ----- -%%R -head(myDF$CITY) ----- - -[source,R] ----- -%%R -head(myDF$STATE) ----- - -[source,R] ----- -%%R -paste(head(myDF$CITY), head(myDF$STATE)) ----- - -Then we can tabulate how many times those 6 city-state pairs occur, and sort the results, and display the head. - -[source,R] ----- -%%R -head(sort(table(paste(head(myDF$CITY), head(myDF$STATE))), decreasing=T)) ----- - -Now that this works for the first 6 city-state pairs, we can do this again for the entire data set. We see that the most donations were made from some typically large cities. There are also a lot of donations from unknown locations. - -[source,R] ----- -%%R -head(sort(table(paste(myDF$CITY, myDF$STATE)), decreasing=T), n=20) ----- - -or in the opposite order: - -[source,R] ----- -%%R -tail(sort(table(paste(myDF$CITY, myDF$STATE)), decreasing=F), n=20) ----- - - -Here are the names of the people who made the largest number of contributions (regardless of the size of the contributions themselves) - -[source,R] ----- -%%R -head(sort(table(myDF$NAME), decreasing=T)) ----- - -Now we can learn how to use the `tapply` function. - -The `tapply` function takes three things, namely, some data, some groups to sort the data, and a function to run on the data. - -For instance, we can take the data about the election transaction amounts, and split the data according the state where the donation was made, and sum the dollar amounts of those election donations within each state. - -[source,R] ----- -%%R -head(sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum), decreasing=T)) ----- - -We can do something similar, now summing the amounts of the transactions in dollars, splitting the data according to the name of the donor: - -[source,R] ----- -%%R -head(sort(tapply(myDF$TRANSACTION_AMT, myDF$NAME, sum), decreasing=T), n=20) ----- - -Now we return to the airline data set from 2005: - -[source,R] ----- -%%R -myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv") ----- - -We can take an average of the departure delays, split according to the airline for the flights: - -[source,R] ----- -%%R -tapply( myDF$DepDelay, myDF$UniqueCarrier, mean, na.rm=T ) ----- - -We can sum the distances of the flights according to the airports where the flights departed: - -[source,R] ----- -%%R -head(sort( tapply( myDF$Distance, myDF$Origin, sum ), decreasing=T )) ----- - -We can take an average of the arrival delays according to the destination where the flights landed. - -[source,R] ----- -%%R -head(sort( tapply( myDF$ArrDelay, myDF$Dest, mean, na.rm=T ), decreasing=T )) ----- - - - - - - - - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day4-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day4-notes.adoc deleted file mode 100644 index 94da4cffc..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day4-notes.adoc +++ /dev/null @@ -1,101 +0,0 @@ -= Think Summer: Day 4 Notes -- 2024 - -We always need to re-load the libraries, if our kernel dies or if we start an all-new session. - -Loading the R `data.table` library - -[source,R] ----- -%%R -library(data.table) ----- - -== Loading the R library for SQL, and loading the database - -We need to load this library, to make a connection to the database at the start. If something goes wrong with our database queries, we can always come back and run these two lines again. Ideally, we should only need to run these once per session, but sometimes we make mistakes, and our kernel dies, and we need to run these lines again. - -[source,R] ----- -%%R -library(RSQLite) -conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db") ----- - -== Importing data from SQL to R - -For example, we can import the number of `titles` per year from SQL into R. (We are doing the work in SQL of finding out how many titles occurred in each year.) - -[source,R] ----- -%%R -myDF <- dbGetQuery(conn, "SELECT COUNT(*), premiered FROM titles GROUP BY premiered;") ----- - -Let's first look at the `head` of the result: - -[source,R] ----- -%%R -head(myDF) ----- - -We can assign names to the columns of the data frame: - -[source,R] ----- -%%R -names(myDF) <- c("mycounts", "myyears") ----- - -and now the `head` of the data frame looks like this: - -[source,R] ----- -%%R -head(myDF) ----- - -Finally, we are prepared to plot the number of titles per year. We plot the years on the x-axis and the counts on the y-axis: - -[source,R] ----- -%%R -plot(myDF$myyears, myDF$mycounts) ----- - -Another way to do this is to import all of the years that titles premiered, and then make a table in `R` and plot the table. (This time, we are doing the work in R of finding out how many titles occurred in each year.) - -[source,R] ----- -%%R -myDF <- dbGetQuery(conn, "SELECT premiered FROM titles;") ----- - -[source,R] ----- -%%R -head(myDF) ----- - -[source,R] ----- -%%R -tail(myDF) ----- - -Now we make a table of the results: - -[source,R] ----- -%%R -table(myDF$premiered) ----- - -and we can plot the results: - -[source,R] ----- -%%R -plot(table(myDF$premiered)) ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day5-notes.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day5-notes.adoc deleted file mode 100644 index cd62d9ce8..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-day5-notes.adoc +++ /dev/null @@ -1,109 +0,0 @@ -= REEU: Day 5 Notes -- 2024 - -We are using the `seminar-r` kernel today - -Loading the R `leaflet` library and `sf` library - -[source,R] ----- -library(leaflet) -library(sf) ----- - -and setting options for the display in Jupyter Lab: - -[source,R] ----- -options(jupyter.rich_display = T) ----- - -Here are three sample points: - -[source,R] ----- -testDF <- data.frame(c(40.4259, 41.8781, 39.0792), c(-86.9081, -87.6298, -84.17704)) ----- - -Let's name the columns as `lat` and `long` - -[source,R] ----- -names(testDF) <- c("lat", "long") ----- - -Now we can define the points to plot: - -[source,R] ----- -points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326) ----- - -and render the map: - -[source,R] ----- -addCircleMarkers(addTiles(leaflet( testDF )), radius=1) ----- - -== Craigslist example - -Now we can try this with Craigslist data - -First we load the `data.table` library - -[source,R] ----- -library(data.table) ----- - -Now we read in some Craigslist data. This takes some time: - -[source,R] ----- -myDF <- fread("/anvil/projects/tdm/data/craigslist/vehicles.csv", - stringsAsFactors = TRUE) ----- - -We can look at the head of the data: - -[source,R] ----- -head(myDF) ----- - -and the names of the variables in the data: - -[source,R] ----- -names(myDF) ----- - -Here are the Craiglist listings from Indiana: - -[source,R] ----- -indyDF <- subset(myDF, state=="in") ----- - -and we want to make sure that the `long` and `lat` values are not missing: - -[source,R] ----- -testDF <- indyDF[ (!is.na(indyDF$long)) & - (!is.na(indyDF$lat))] ----- - -Now we set the points to be plotted: - -[source,R] ----- -points <- st_as_sf( testDF, coords=c("long", "lat"), crs=4326) ----- - -and we draw the map: - -[source,R] ----- -addCircleMarkers(addTiles(leaflet( testDF )), radius=1) ----- - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-01.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-01.adoc deleted file mode 100644 index 6abdaa6b0..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-01.adoc +++ /dev/null @@ -1,21 +0,0 @@ -= Think Summer: Project 1 -- 2024 - - -== How many people from the IMDB database were born in the same year as you? How many were born in the year that Dr Ward was born (1976)? Find your favorite actor or actress in the IMDB database and get their personID. What year were they born? - - -== In which years was the tvSeries House of Cards broadcast? Hint: You likely need to limit the results so that the `type` is `tvSeries`. - - -== Consider the tvSeries The West Wing. How many episodes of this tvSeries aired altogether? Now, consider your favorite tvSeries: how many episodes does your favorite show have? - - -== During the years 1980 to 1990, how many people (from the people table) were born in each year? - - -== Lookup the `personID` for Whoopi Goldberg on IMDB. Using her `personID`, how many times has she been a member of a `crew`? How many times was she an actress? A producer? A writer? Now pick your own favorite director, actor or actress, or writer, etc., and find how many crews include that person? - - -== How many titles have Adventure as one of the genres? In which year did the most titles appear, with Adventure as one of the genres? - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-02.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-02.adoc deleted file mode 100644 index 9654b0d16..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-02.adoc +++ /dev/null @@ -1,29 +0,0 @@ -= Think Summer: Project 2 -- 2024 - -== On Monday, in Question 5, we started with Whoopi Goldberg's `person_id`. Revisit this question WITHOUT looking up her `person_id` from the Internet, as follows: Search in the `people` table for entries with `name` equal to `Whoopi Goldberg`. Then `join` the `people` table to the `crew` table, to discover the same things as you did on Monday, namely: how many times has she been a member of a `crew`? How many times was she an actress? A producer? A writer? Now pick your own favorite director, actor or actress, or writer, etc., and find how many crews include that person? Hint: When you choose your own favorite person, start with their name (instead of starting with their `person_id`). Be sure to check that there is only one person with that name in the `people` table. - - -== Who was the Director of the movie `Say Anything...`? Hint: Start by finding the movie title `Say Anything...` in the `titles` table. Then `join` the `titles` table to the `crew` table, to find out which person was the director of `Say Anything...`. (P.S. This is Dr Ward's all-time favorite movie.) - - -== Starting with only the `person_id` from Question 2, use (only one) SQL query to find their name and also the titles of all of the movies that they have directed in their career. - - -== Join the `titles` table and the `ratings` table, to see how many tvEpisode values have more than 200000 votes. - - -== How many tvEpisodes have more than 1000 votes and also have rating 8 or higher? - - -== For how many movies has George Clooney been on the `crew`? - - -== Which of George Clooney's movies are the most popular? (By "popularity", you can choose to either analyze the ratings or the number of votes; either way is OK with us! Just be sure to explain what you did in your solution.) - - -== How many episodes of The West Wing had rating 9 or higher? - - -== Bonus question (TOTALLY OPTIONAL! Not required!) In question 4, we found the ID numbers of two episodes that had more than 200000 votes each. It turns out that these two episodes are from the same tvSeries. Which tvSeries was this? To solve this question, first `join` the `episodes` table, linking your previous `title_id` values to the `episode_title_id`. After you check that this worked, then `join` the `titles` table again (using a new nickname for the titles table, of course~), linking your `show_title_id` to the `title_id`. - - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-03.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-03.adoc deleted file mode 100644 index e7988d2b4..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-03.adoc +++ /dev/null @@ -1,58 +0,0 @@ -= Think Summer: Project 3 -- 2024 - -=== Question 1 - -How many flights departed from Dallas and landed in Denver? (Please lookup the airport codes for both airports.) How many flights went the other way, from Denver back to Dallas? - -=== Question 2 - -First consider only the flights that arrive to Cincinnati (airport code `CVG`), i.e., for which Cincinnati is the destination. What are the 10 most popular origin airports, for travelers coming to Cincinnati from these other cities? Consider a major airport that is near where you grew up, and answer the same question for that airport. - -=== Question 3 - -Each airplane has a unique `TailNum`. Which airplane flew the most flights in 2005? Hint: If you strictly look at the `TailNum` values, you will need to ignore the top two results, because they are missing data. (Missing data is part of real life! Almost every data set in the real world has a ton of missing data!) - -=== Question 4 - -Which airplane flew the most times from Chicago to Indianapolis? Hint: Again, if you strictly look at the `TailNum` values, you may need to ignore the top result, because it has missing data. - -=== Question 5 - -What is the Distance (in miles) of any individual flight from Seattle (`SEA`) to Los Angeles (`LAX`)? Hint: You might paste together the origin and destination airports of the flights. - -=== Question 6 - -Which airline carrier had the most flights from Atlanta to New York City in 2005? (The `UniqueCarrier` is the airline carrier for the flight.) You can consider this question in several ways, since New York City has several airports, but it is recommended to focus on John F. Kennedy International Airport (JFK), LaGuardia Airport (LGA), and Newark International Airport (EWR). - -=== Question 7 - -Pick 5 holidays from 2005 (government, religious, etc., whichever you want), and see how many flights departed on each of those days. Hint: You might paste together the month and day of the flights. - -=== Question 8 - -What is the best month for traveling, i.e., the average departure delays are the best during that month? Hint: You might use a tapply function. - -=== Question 9 - -Make a `dotchart` that illustrates the data from the previous question (about the average flight delays in each month). - -=== Question 10 - -Create your own interesting question about the 2005 flight data. What insights can you find? - -=== Question 11 - -How much money was donated in the federal election campaigns (altogether) in 2000 from the state where you grew up? - -=== Question 12 - -From which zip code was the most money donated? (Where is the zip code in the USA?) Hint: Do not worry too much about the fact that people sometimes write 5-digit zip codes and sometimes write 9-digit zip codes. We are just getting familiar with the data! - -=== Question 13 - -Consider a profession that you are interested in (for instance, "professor" or "engineer" or "scientist"). Spend a little time analyzing how much money was donated by people from that profession, during the 2000 federal election campaigns. - -=== Question 14 - -Create your own interesting question about the 2000 federal election campaign data. What insights can you find? - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-04.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-04.adoc deleted file mode 100644 index 737cef898..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-04.adoc +++ /dev/null @@ -1,59 +0,0 @@ -= Think Summer: Project 4 -- 2024 - -=== Question 1 - -For the show the Gilmore Girls, there are 7 seasons listed in the IMDB database. Find the average rating of each of the seven seasons. Hint: Use `AVG` for find the average, and `GROUP BY` the `season_number`. Make a plot or dotchart to show the average rating for each season in R. - -=== Question 2 - -Identify the six most popular episodes of the show Grey's Anatomy (where "popular" denotes a high rating). - -=== Question 3 - -Make a dotchart in R showing the results of the previous question. -Hint: You can use your work from SQL, and export the results to a dataframe called `myDF` in R. Then you can use something like: - -[source,R] ----- -# use a dbGetQuery here, to import the SQL results to R, and then -myresults <- myDF$rating -names(myresults) <- myDF$primary_title -dotchart(myresults) ----- - -=== Question 4 - -Make a plot or dotchart showing the total amount of money donated in each of the top 10 states, during the 2000 federal election cycle. - -=== Question 5 - -Make a dotchart that shows how many movies premiered in each year. You do not need to show all of the years; there are too many years! Just show the number of movies premiered in each year since the year 2000. - -=== Question 6 - -Among the three big New York City airports (`JFK`, `LGA`, `EWR`), which of these airports had the worst `DepDelay` (on average) in 2005? (Can you solve this with 1 line of R, using a `tapply` (rather than using 3 separate lines of R)? Hint: After you run the `tapply`, you can index your results using `[c("JFK", "LGA", "EWR")]` to lookup all 3 airports at once.) - -=== Question 7 - -Use `LIKE` to analyze the `primary_title` of all IMDB titles: First determine how many titles have `Batman` anywhere in the title, and then determine how many titles have `Superman` anywhere in the title? Which one occurs more often? - -=== Question 8 - -How much money was donated during the 2000 federal election cycle by people who have `PURDUE` listed somewhere in their employer name? How much money was donated by people who have `MICROSOFT` listed somewhere in their employer name? Hint: You might use the `grep` or the `grepl` (which is a logical grep) to solve this one. - -=== Question 9 - -How much money was donated during the 2000 federal election cycle by people from your hometown? (Be sure to match the city and the state.) - -=== Question 10 - -During the years 2000 to 2020, how many people (from the people table) died in each year? Make a plot or dotchart to show the number of people who died in each year. - -=== Question 11 - -Consider only the flights that arrive to Indianapolis (airport code `IND`), i.e., for which Indianapolis is the destination. What are the 10 most popular origin airports? Make a plot or dotchart to show the number of flights from each of these 10 most popular origin airports (with Indianapolis as the destination airport). - -=== Question 12 - -Create your own interesting question based on the things you have learned this week. What insights can you find? - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-introduction.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-introduction.adoc deleted file mode 100644 index 5ef9a9e74..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-introduction.adoc +++ /dev/null @@ -1,84 +0,0 @@ -= Think Summer: Introduction -- 2024 - -== Submission - -Students need to submit the following file **by 10:00PM EST** through Gradescope inside Brightspace. - -. A Jupyter notebook (a `.ipynb` file). - -We've provided you with a template notebook for you to use. Please carefully read xref:summer2024/summer-2024-project-template.adoc[this section] to get started. - -[CAUTION] -==== -When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click menu:Run[Run All Cells]. Next, to export your notebook (your `.ipynb` file), click on menu:File[Download], and download your `.ipynb` file. -==== - -== Project - -**Motivation:** SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data -- amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses. - -**Context:** This project is specially designed for Purdue University's Think Summer program, and is coordinated by https://datamine.purdue.edu/[The Data Mine]. - -**Scope:** SQL, SQL in R - -.Learning Objectives -**** -- Demonstrate the ability to interact with popular database management systems within R. -- Solve data-driven problems using a combination of SQL and R. -- Use basic SQL commands: select, order by, limit, desc, asc, count, where, from. -- Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having. -**** - -== Dataset - -The following questions will use the `imdb` database found in Anvil, our computing cluster. - -This database has 6 tables, namely: - -`akas`, `crew`, `episodes`, `people`, `ratings`, and `titles`. - -You have a variety of options to connect with, and run queries on our database: - -. Run SQL queries directly within a Jupyter Lab cell. - -First, you must launch a new Jupyter Lab instance. To do so, please follow the instructions below. - -. Open a browser and navigate to https://ondemand.anvil.rcac.purdue.edu, and login using your ACCESS credentials. -+ -. Click on "My Interactive Sessions". -+ -. Click on Jupyter Notebook in the left-hand menu **under "The Data Mine" section** (near the bottom of the screen). Select the following settings: -+ -* Allocation: cis220051 -* Queue: shared -* Time in Hours: 3 -* Cores: 1 -* Use Jupyter Lab instead of Jupyter Notebook: Checked -+ -. When satisfied, click btn:[Launch], and wait for a minute. In a few moments, you should get a note indicating that your session is ready to run. -+ -. When you are ready, click btn:[Connect to Jupyter]. A new browser tab will launch. -+ -. Under the "Notebook" menu, please select the btn:[seminar] (look for the big "S"; we do not want btn:[seminar-r]). -+ -You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the https://anvil.rcac.purdue.edu[Anvil cluster]. By using OnDemand, you've essentially carved out a small portion of the compute power to use. Congratulations! - -To run queries directly in a Jupyter Lab cell, please do the following. - -. In the first cell, run the following code. This code establishes a connection to the `imdb.db` database, which allows you to directly run SQL queries in a cell as long as that cell has `%%sql` at the top of the cell. -+ -[source, ipynb] ----- -%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db ----- -+ -. After running that cell (for example, using kbd:[Ctrl+Enter]), you can directly run future queries in each cell by starting the cell with `%%sql` in the first line. For example. -+ -[source, sql] ----- -%%sql - -SELECT * FROM titles LIMIT 5; ----- -+ - diff --git a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-template.adoc b/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-template.adoc deleted file mode 100644 index f4a501e68..000000000 --- a/projects-appendix/modules/ROOT/pages/summer2024/summer-2024-project-template.adoc +++ /dev/null @@ -1,106 +0,0 @@ -= Templates - -Our course project template can be copied into your home directory in two different ways: - -You can download it in your browser xref:attachment$think_summer_project_template.ipynb[here] and then upload it using the little up-arrow to your Anvil account, or you can use the following code in a Jupyter Lab cell: - -[source,bash] ----- -%%bash -cp /anvil/projects/tdm/etc/think_summer_project_template.ipynb $HOME ----- - -Students can use and modify this as a template as needed, for all project submissions. This template is a starting point for all projects. - -By default, the `seminar` kernel runs Python code. To run other types of code, see below. - -== Running `R` code using the `seminar` kernel - -[source,ipython] ----- -%%R - -my_vec <- c(1,2,3) -my_vec ----- - -As you can see, any cell that begins with `%%R` will run the R code in that cell. If a cell does not begin with `%%R`, it will be assumed that the code is Python code, and run accordingly. - -== Running SQL queries using the `seminar` kernel - -. First, you need to establish a connection with the database. If this is a sqlite database, you can use the following command. -+ -[source,ipython] ----- -%sql sqlite:///my_db.db -# or -%sql sqlite:////anvil/projects/tdm/data/path/to/my_db.db ----- -+ -Otherwise, if this is a mysql database, you can use the following command. -+ -[source,ipython] ----- -%sql mariadb+pymysql://username:password@my_url.com/my_database ----- -+ -. Next, to run SQL queries, in a new cell, run the following. -+ -[source,ipython] ----- -%%sql - -SELECT * FROM my_table; ----- - -As you can see, any cell that begins with `%%sql` will run the SQL query in that cell. If a cell does not begin with `%%sql`, it will be assumed that the code is Python code, and run accordingly. - -== Running `bash` code using the `seminar` kernel - -To run `bash` code, in a new cell, run the following. - -[source,bash] ----- -%%bash - -ls -la ----- - -As you can see, any cell that begins with `%%bash` will run the `bash` code in that cell. If a cell does not begin with `%%bash`, it will be assumed that the code is Python code, and run accordingly. - -[TIP] -==== -Code cells that start with `%` or `%%` are sometimes referred to as magic cells. To see a list of available magics, run `%lsmagic` in a cell. - -The commands listed in the "line" section are run with a single `%` and can be mixed with other code. For example, the following cell contains (in order) some Python code, uses a single line magic, followed by some more Python code. - -[source,ipython] ----- -import pandas as pd - -%time myDF = pd.read_parquet("/anvil/projects/tdm/data/whin/weather.parquet") - -myDF.head() ----- - -The commands listed in the "cell" section are run with a double `%%` and apply to the entire cell, rather than just a single line. For example, `%%bash` is an example of a cell magic. - -You can read more about some of the available magics in the https://ipython.readthedocs.io/en/stable/interactive/magics.html#[official documentation]. -==== - -== Including an image in your notebook - -To include an image in your notebook, use the following Python code. - -[source,python] ----- -from IPython import display -display.Image("./cloud.png") ----- - -Here, `./cloud.png` is the path to the image you would like to include. - -[IMPORTANT] -==== -If you choose to include an image using a Markdown cell, and the `![](...)` syntax, please note that while the notebook will render properly in our https://ondemand.anvil.rcac.purdue.edu environment, it will _not_ load properly in any other environment where that image is not available. For this reason it is critical to include images using the method shown here. -==== diff --git a/projects-appendix/modules/ROOT/pages/template.adoc b/projects-appendix/modules/ROOT/pages/template.adoc deleted file mode 100644 index 1baa84566..000000000 --- a/projects-appendix/modules/ROOT/pages/template.adoc +++ /dev/null @@ -1,87 +0,0 @@ -= TDM 10100: [LANGUAGE] Project X -- 2024 - -**Motivation:** Ipsum lorem - -**Context:** Ipsum lorem - -**Scope:** Ipsum lorem - -.Learning Objectives: -**** -- Ipsum lorem -- Ipsum lorem -- Ipsum lorem -**** - -Make sure to read about, and use the template found xref:templates.adoc[here], and the important information about project submissions xref:submissions.adoc[here]. - -== Dataset(s) - -This project will use the following dataset(s): - -- Ipsum lorem -- Ipsum lorem - -== Questions - -=== Question 1 (2 pts) - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 2 (2 pts) - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 3 (2 pts) - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 4 (2 pts) - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -=== Question 5 (2 pts) - -Ipsum lorem dolor sit amet, consectetur adipiscing elit - -.Deliverables -==== -- Ipsum lorem -==== - -== Submitting your Work - -This is where we're going to say how to submit your work. Probably a bit of copypasta. - -.Items to submit -==== -- Ipsum lorem -- Ipsum lorem -==== - -[WARNING] -==== -You _must_ double check your `.ipynb` after submitting it in gradescope. A _very_ common mistake is to assume that your `.ipynb` file has been rendered properly and contains your code, markdown, and code output even though it may not. **Please** take the time to double check your work. See https://the-examples-book.com/projects/submissions[here] for instructions on how to double check this. - -You **will not** receive full credit if your `.ipynb` file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this. -==== \ No newline at end of file diff --git a/projects-appendix/modules/ROOT/pages/templates.adoc b/projects-appendix/modules/ROOT/pages/templates.adoc deleted file mode 100644 index 091c98820..000000000 --- a/projects-appendix/modules/ROOT/pages/templates.adoc +++ /dev/null @@ -1,34 +0,0 @@ -= Templates - -Any of these three options can be used, to get the project template into your account for the first time. You only need to do this once! Any of these three options is OK; you do *NOT* need to do all 3 options. - -== Option 1 - -=== How to download the template to your computer and then upload it to Jupyter Lab - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_c5k8i7jk&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -== Option 2 - -=== How to download the template using the `File / Open from URL` option - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_aswefkfa&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -== Option 3 - -=== How to download the template by copying it in the terminal - -++++ -<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_7zlhwgi1&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_aheik41m" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="TDM 10100 Project 13 Question 1"></iframe> -++++ - -Our course project template can be found xref:attachment$project_template.ipynb[here], or on Anvil: - -`/anvil/projects/tdm/etc/project_template.ipynb` - -Students in TDM 101000, 20100, 30100, and 40100 can use and modify this as a template as needed, for all project submissions. This template is a starting point for all projects. -